idnits 2.17.1 

draft-rosenberg-itg-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-26) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 1285 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack an Authors' Addresses Section.

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 541 instances of weird spacing in the document.  Is it really
     formatted ragged-right, rather than justified?

  ** There is 1 instance of too long lines in the document, the longest one
     being 2 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 14 has weird spacing: '...This  document...'

  == Line 15 has weird spacing: '...cuments  of  t...'

  == Line 16 has weird spacing: '... groups  may  ...'

  == Line 20 has weird spacing: '...months  and  m...'

  == Line 22 has weird spacing: '...as  reference ...'

  == (536 more instances...)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (November 26, 1996) is 10013 days in the past.  Is
     this intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  ** Obsolete normative reference: RFC 1889 (ref. '2') (Obsoleted by RFC 3550)

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'

  ** Obsolete normative reference: RFC 1890 (ref. '4') (Obsoleted by RFC 3551)


     Summary: 14 errors (**), 0 flaws (~~), 8 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Engineering Task Force      Audio-Visual Transport WG
2	INTERNET-DRAFT                                    J. Rosenberg
3	                                     Lucent, Bell Laboratories
4	                                                H. Schulzrinne
5	                                           Columbia University
6	                                             November 26, 1996
7	                                         Expires: May 26, 1997

9	     Issues and Options for an Aggregation Service within RTP
10	                     draft-rosenberg-itg-00.txt

12	Status of this Memo

14	This  document is an Internet-Draft. Internet-Drafts  are  working
15	documents  of  the  Internet Engineering Task  Force  (IETF),  its
16	areas,  and its working groups.  Note that other groups  may  also
17	distribute working documents as Internet Drafts.

19	Internet-Drafts  are draft documents valid for a  maximum  of  six
20	months  and  may  be  updated, replaced,  or  obsoleted  by  other
21	documents at any time.  It is inappropriate to use Internet-Drafts
22	as  reference  material or to cite them other  than  as  "work  in
23	progress."

25	To  learn  the current status of any Internet-Draft, please  check
26	the  "1id-abstracts.txt" listing contained in the  Internet-Drafts
27	Shadow   Directories   on  ftp.is.co.za  (Africa),   nic.nordu.net
28	(Europe),  munnari.oz.au (Pacific Rim), ds.internic.net  (US  East
29	Coast), or ftp.isi.edu (US West Coast).

31	Distribution of this document is unlimited.

33	                             Abstract

35	    This  memorandum discusses the issues and options involved
36	    in  the design of a new transport protocol for multiplexed
37	    voice within a single packet. The intended application  is
38	    the interconnection of devices which provide "trunking" or
39	    long  distance  telephone service over the Internet.  Such
40	    devices have many voice connections simultaneously between
41	    them.  Multiplexing them into the same connection improves
42	    on  the  efficiency, enables the use of low bitrate  voice
43	    codecs,  and  improves  scalability.  Options  and  issues
44	    concerning   timestamping,  payload  type  identification,
45	    length   indication,   and  channel   identification   are
46	    discussed. Several possible header formats are identified,
47	    and their efficiencies are compared.

49	This  document  is a product of the Audio-Video Transport  working
50	group  within  the Internet Engineering Task Force.  Comments  are
51	solicited  and should be addressed to the working group's  mailing
52	list at rem-conf@es.net and/or to the author(s).

54	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 1
55	1.   Introduction

57	With  the  tremendous changes in the telecommunications  industry,
58	and  the recent growth of the Internet, there is a new opportunity
59	for  offering  long distance telephony over the Internet.  Such  a
60	service  can  be offered by allowing users to dial a local  access
61	number,  connecting them to a device called an Internet  Telephony
62	Gateway  (ITG).  This device prompts the user  for  a  destination
63	telephone number, and then routes the call over the Internet to  a
64	similar  device  at the local exchange of the destination.  There,
65	the call is completed when the destination ITG dials the end user.
66	The scenario is depicted in Figure 1.

68	 -------           --------                 ----------
69	| Phone | --------| NY ITG |---------------| Internet |
70	 -------           --------                |          |
71	                                           |          |
72	                                           |          |
73	 -------           --------                |          |
74	| Phone | --------| LA ITG |---------------|          |
75	 -------           --------                 ----------

77	               Figure 1: Internet Telephony Gateway
78	In  this  application,  the Internet is used  only  for  the  long
79	distance  portion of the telephone call. Access to the service  is
80	still  via the traditional POTS. Current implementations  of  this
81	service  are using H.323 to set up and tear down a new  connection
82	each  time a user establishes or terminates a call. However, H.323
83	is  the  wrong  protocol for many reasons. First, it  is  far  too
84	complex,  providing for capabilities and features which cannot  be
85	used  because  both endpoints are analog telephones.  Secondly,  a
86	significant  increase in efficiency (in excess  of  30%),  can  be
87	readily  achieved if all of the voice calls between  two  ITG  are
88	multiplexed  into  a single packet, instead of  using  a  separate
89	connection (and thus separate packets) for each. Such multiplexing
90	reduces  overhead  by increasing the effective payload  without  a
91	corresponding  penalty in packetization delay. In  fact,  as  more
92	users  are multiplexed, the payload from a particular user can  be
93	reduced  in  size, or the bitrate reduced, without  an  efficiency
94	penalty.  Furthermore, multiplexing improves scalability.  As  the
95	number  of users increases, the number of packets which arrive  at
96	the  destination  does not increase. This means that  computations
97	which  are per-packet (such as RTCP statistics collecting,  jitter
98	accumulation,  header processing, etc.) do not increase.  The  end
99	result is that multiplexing can simultaneously improve efficiency,
100	reduce  delay, and improve scalability. There are some minor  side

102	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 2
103	benefits  in  addition to these major three. For example,  in  the
104	aggregated  scenario,  when a particular  user  enters  a  silence
105	period, and stops sending data, the flow of packets will not  stop
106	unless  all  of the other users are already in silence (generally,
107	an  unlikely  event). This means that packets continually  arrive,
108	and  that  delay  estimates obtained from  those  packets  can  be
109	continuously  generated. Algorithms for dynamically  adapting  the
110	playout  buffer at the receiver are based on these delay estimates
111	[1],  and can now be reworked to utilize the continuous stream  of
112	delays,  as  opposed  to  relying on the  delays  received  during
113	talkspurts only. The result is likely to be an improvement in both
114	end to end delay and loss performance.

116	In  order to perform such multiplexing, a new Internet protocol is
117	required. This protocol must provide for the transport of multiple
118	real  time  streams within a single IP packet. Since the  intended
119	application  is  real-time, the requirements for timing  recovery,
120	sequencing,  and  payload identification are nearly  identical  to
121	normal  single  user voice. Since RTP was designed to  meet  these
122	requirements  [2], it makes sense to build this  new  multiplexing
123	protocol on top of RTP. In fact, RTP allows for different profiles
124	to  be  defined  for a particular application. The  goal  of  this
125	document  is to define a variety of options for that new  profile,
126	and to compare them.

128	It  is  important to note that this application is similar in  its
129	requirements  to [3], which seeks to multiplex multiple  encodings
130	for a particular user into the same IP packet.

132	2.   Terminology

134	User: One of the individuals who has data within the IP packet.
135	Connection: The point to point RTP session between two ITG's.
136	Channel: A "virtual connection" which is established by allowing a
137	user  to  send  data within a packet. There are many channels  per
138	connection - this represents the multiplexing.
139	Channel Identifier: A number which identifies a channel.
140	Block: The section of the payload of a packet which contains  data
141	for a particular user.

143	3.   Requirements:

145	The  transport protocol must provide, at a minimum, the  following
146	functionality:

148	1.    Delineation.  Data  from different  users  must  be  clearly
149	  delineated.
150	2.   Identification. The channel to which the data belongs must be
151	identified.
152	3.   Variable lengths: The protocol should support variable length
153	  blocks  from  a  particular user. This allows for variable  rate
154	  codecs.
155	4.    Low  overhead: Since the protocol is designed for  low  rate
156	  voice,  it  should  have low overhead. This issue  is  extremely
157	  important. New coders are emerging which can support  near  toll
158	  quality at 8 kbps, and acceptable quality at rates even as low as
159	  4 kbps. It is desirable to support such codecs, as they can reduce

161	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 3
162	  the  cost of providing an ITG service. Furthermore, advances  in
163	  coding technology indicate that it is desirable to send very low
164	  bitrate information (1 kbps or less) during silence periods,  so
165	  that  background  noise can be reproduced well  (as  opposed  to
166	sending  nothing). Support of such rates requires a protocol  with
167	  low overhead.
168	5.    Marker: A general purpose marker bit should be available for
169	  all users within the connection.
170	6.   Payload Identification. The codec in use for each user should
171	  be indicated somehow. It is a requirement to allow for the coding
172	  type to change during the lifetime of  a channel.

174	4.   Issues
175	The following section identifies a number of issues which have  an
176	impact on the design of the protocol. It also identifies a variety
177	of options for providing the specific services of the protocol.

179	4.1  How to bind telephone numbers to channel identifiers

181	There  are  four  options for this problem. First,  the  telephone
182	number  can  be  included  in  the per-user  header.  Second,  the
183	telephone  number  can be signaled reliably  by  a  companion  TCP
184	connection  before data begins. Thirdly, the phone number  can  be
185	sent  periodically in RTCP in a soft-state fashion. Fourthly,  the
186	information  can  be sent periodically over a reliable  TCP  based
187	control  channel.  The first approach avoids  any  synchronization
188	problems,  but has high overhead. The second approach  is  a  more
189	traditional  approach, but relies on hard state at the destination
190	ITG.  The third approach allows for a refresh of state, but causes
191	longer  setup  delays  in  the face of  packet  loss.  The  fourth
192	approach  guarantees  reliable delivery of signaling  information,
193	but also generates refreshes to allow for recovery from end-system
194	failures.

196	The  most reasonable approach seems to be the second - the use  of
197	TCP  (or  any  other  reliable  protocol)  for  sending  signaling
198	information.   This   approach  guarantees   that   the   critical
199	information  is  received correctly, and in a  timely  manner.  It
200	avoids bandwidth inefficient refresh as well.

202	4.2  Payload type identification

204	There  are a number of ways to identify the coding of the payload.
205	The  first  is  through static types, identified by  bits  in  the
206	header  (like RTP is now). The second approach dynamically adjusts
207	the  coding  type based on external messages which bind  a  coding
208	type to a channel identifier. Such external messages can be either
209	UDP  or  TCP  based. A related issue is synchronization  of  these
210	changes.  Either the timestamps or sequence numbers can  be  used.
211	One  approach to performing the synchronization is as follows: The
212	source  sends a message reliably to the receiver, indicating  that
213	it  will  change  codings at timestamp N, where N is  some  future
214	timestamp  (or  SN). The N should be chosen far  enough  into  the
215	future  to  guarantee that the receiver will get the  TCP  message
216	before  time N. The farther away N is, the more robust the  system
217	becomes,  but the source also loses its ability to adapt  quickly.
218	There  are  also  several  options for  simple  in-band  signaling
219	methods which can assist in error recovery. This is based  on  the
220	assumption  that it is better for the receiver to  know  that  the

222	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 4
223	encoding  has changed (even though it doesn't know to what),  than
224	to know nothing. This avoids playing garbage out. A one or two bit
225	"coding sequence number" can be used in the header. Such a  number
226	starts  at zero. At the timestamp where the encoding changes,  the
227	SN  increments,  and stays incremented until the next  change.  In
228	this  fashion, we are guaranteed that the source will  never  play
229	out  data  using the wrong coding type. Probably just one  or  two
230	bits of this SN is necessary.

232	Yet  another  approach to changing payload types is  via  "pseudo-
233	dynamic"  payloads.  Before  transmission  of  data  commences,  a
234	reliable  exchange  occurs which downloads  a  table  of  possible
235	encodings  of the payload type, based on the capabilities  of  the
236	source.  The  table then remains active for the  lifetime  of  the
237	connection. This technique can reduce the number of bits  required
238	for  the  payload type, since a particular gateway  is  likely  to
239	support  just  a  few codecs. However, it is still  a  hard  state
240	approach,  but  it  would  only fail in the  face  of  end  system
241	failure, not network failure.

243	Our  conclusion is that it is desirable to have the PTI  field  in
244	the  payload.  This  makes it possible  to  do  more  robust  rate
245	control,   which  becomes  a  significant  issue   when   multiple
246	connections are multiplexed together (and therefore the  aggregate
247	bitrate  increases).  It also makes sense to  signal  a  table  of
248	encodings for the payload type at the beginning of the connection.
249	Any  particular  pair of ITG will generally  only  support  a  few
250	codecs. Therefore, dynamically setting the codings of the PTI  bit
251	makes  a  more compact representation possible without restricting
252	the set of codecs which may be used.

254	4.3  Timestamps
255	Timing is a very complex issue for the multiplexing protocol.  The
256	first  question related to it is whether the protocol will support
257	mixing  of  media  derived from separate clocks (i.e.,  voice  and
258	video). Although doing this seems attractive, it is complex and in
259	opposition  to  the philosophy under which RTP was developed.  RTP
260	explicitly states that separate media should be placed in separate
261	RTP  streams.  This allows for different QoS to be  requested  for
262	each  media, and for clocks to be defined based on the media type.
263	Furthermore,  this  profile is geared towards the  aggregation  of
264	voice  traffic generated from the POTS across the Internet.  As  a
265	result, the only source of data is from a single, 125us clock.

267	The   next  basic  question  is  whether  timestamps  are   needed
268	"globally", i.e., just one per packet independent of the number of
269	users, or "locally", whereby each user within a packet needs their
270	own  timestamp. A separate question is the representation of these
271	timestamps   in  an  efficient  manner.  When  considering   these
272	questions, the criteria to keep in mind are:

274	1.   Can silence periods be recovered correctly
275	2.   Can resynchronization occur in the face of packet loss
276	3.   What is the impact on playout buffering and jitter
277	computation

279	The answer to this question depends on the desired capabilities of

281	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 5
282	the  protocol.  In the most general case, it is possible  to  have
283	different frame sizes for each user (for example, 20ms, 10ms,  and
284	15ms)  within  the  same packet. These frames can  be  arbitrarily
285	aligned  in time with respect to each other (i.e., the 20ms  frame
286	starts  5.3 ms after the beginning of another user's 10 ms frame).
287	The  user can send packets off at any point, containing data  from
288	those  users  whose frames have been generated before  the  packet
289	departure time. A somewhat more restrictive capability is to allow
290	for different frame sizes and time alignments, but to require that
291	any packet contains all the same frame sizes, all aligned in time.
292	The  most restrictive case is to require separate RTP sessions for
293	users  with different frame sizes. This requires a channel  to  be
294	torn  down  and  re-setup when it changes  codec.  The  desire  to
295	perform  flow  control on a channel-by-channel  basis  makes  this
296	approach unacceptable, and it is not considered further.

298	4.3.1     General Case

300	First  consider the general case. Packets can contain frames  from
301	some or all of the users, and those frames are not the same length
302	nor  time  aligned in any way. An example of such  a  scenario  is
303	depicted in Figure 2. In the figure, there are three sources,  and
304	the  ti  correspond to the times of packet emissions. When packets
305	are lost, the variability in the amount and time alignment of data
306	in  each  packet makes it impossible to reconstruct how much  time
307	had  elapsed based solely on sequence numbers (such reconstruction
308	IS  possible in the single user case). Furthermore, the amount  of
309	time  elapsed  can  easily vary from user to user,  and  therefore
310	local timestamps are needed.

312	The general case introduces further complications which have to do
313	with  jitter and delay computation. Such computations  are  needed
314	for  RTCP  reporting  and possibly for the estimation  of  network
315	delays, used in dynamic playout buffers. In the single user  case,
316	the jitter is computed between each packet as:

318	                  D(i,j) = (Rj - Ri) - (Sj - Si)

320	Where  the  Ri  correspond to the reception times at the  receiver
321	measured  in  RTP time, and the Si are the RTP timestamps  in  the
322	data packets. The delay is computed as the difference between  the
323	arrival time at the receiver and generation time, as indicated  by
324	the RTP timestamp.

326	In the multiple user case, these definitions no longer make sense,
327	as  there  is  no single RTP timestamp any longer.  Each  arriving
328	packet will have a single arriving time (Ri), but multiple sending
329	times  (Si,j)  for  each block j in the ith packet.  There  are  a
330	number  of alternatives for delay and jitter computation  in  this
331	case:  compute  such  information  for  all  users,  compute  such
332	information  for  a single user, or generate a  single  delay  and
333	jitter  estimate,  but  have it be based on information  from  all
334	users. There are pros and cons to each approach.

336	First  of  all, it is possible for different blocks to  experience
337	different  delays (and jitters) even though they  are  within  the

339	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 6
340	same  packet.  This  is because the general  scenario  allows  for
341	significant  variability, whereby blocks may either vary  in  size
342	from  packet  to packet and within a packet, or not be transmitted
343	immediately after their completion (the latter happens to source B
344	in  Figure  2). Thus, it is arguable they it may be  desirable  to
345	perform adaptive playout buffering separately for each user, which
346	would require the storage and computation of delays for each user.

348	The second alternative is to compute the delays for a single user,
349	and use that information to size all of the other playout buffers.
350	This  may be sub-optimal in terms of delay and loss, depending  on
351	what fraction of the total delay and jitter are introduced by  the
352	packetization  itself.  There  is a second  disadvantage  to  this
353	approach,  however.  When that particular user  enters  a  silence
354	period,  delay and jitter information is no longer being received,
355	and so estimates of network delay stop adapting. This implies that
356	delay  estimates  will  be old for certain  periods  of  time.  An
357	alternative  is  to change the user from which  delay  and  jitter
358	estimates are being collected.

360	The  third alternative is to compute delay estimates based on some
361	measure   derived  from  all  of  the  users.  There  are  several
362	reasonable  approaches. For example, the  delay  estimate  can  be
363	computed as:

365	                     Delay = max{j, Ri - Si,j}

367	which  would yield a conservative estimate of the delay  for  some
368	users.  This  approach requires storage of only a  single  set  of
369	delay  information,  although computation  still  grows  with  the
370	number of users in a packet.

372	 --------------------------------------------------
373	||               ||               ||               ||
374	 -----------------------------------------------------
375	||       ||       ||       ||       ||       ||       ||
376	 -----------------------------------------------------
377	||         ||         ||         ||         ||         ||
378	 -----------------------------------------------------

380	           t1     t2  t3   t4    t5 t6       t7        t8

382	                Figure 2: Global Timestamp Problem

384	Sending  local timestamps also requires extra bits  in  the  block
385	headers.  It  is possible, however, to use offsets for  the  local
386	timestamps. A global timestamp can be used in the RTP header  (the
387	field  already exists), and each user has a modifier  to  indicate
388	position in time relative to that timestamp.

390	A  related  question  is how big to make the  offset  field.  This
391	offset  is bounded by the difference in time between the  earliest
392	and  latest  samples  within a packet.  Clearly,  this  itself  is
393	bounded  by  the  packetization delay  at  the  source.  For  this
394	application,  if  we  assume  a  125us  sample  clock,  and  bound
395	packetization delays to 100ms, the offset field is bounded by  800
396	ticks, requiring 10 bits.

398	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 7
399	4.3.2     More Restrictive Case

401	As  a  more restrictive case, we allow blocks to be present  in  a
402	packet  if  their frame sizes are identical and aligned  in  time.
403	Note  that this does not imply identical codecs or identical block
404	sizes in terms of bytes; many voice codecs operate with a 20ms  or
405	50ms frame size. This case would allow all frame sizes of the same
406	size and time alignment, independent of the codec, into a packet.

408	This  simplifies the timing issue tremendously. Now, the  scenario
409	is  much  more  like  the  single user application.  The  sequence
410	numbers and the frame size completely determine the timing when at
411	least  one  user is active. But, when all users enter  silence,  a
412	global timestamp is needed to indicate the duration of the silence
413	period.  The  global  timestamp is sufficient to  reconstruct  the
414	timing  in  the face of losses. Therefore, in this  case,  only  a
415	global timestamp is required.

417	It  is  desirable  to support a variety of different  frame  sizes
418	within such an aggregated connection, however. The way to do  this
419	in  this  case  is  to simply mandate that different  packets  can
420	contain  different frame sizes; the only restriction is  within  a
421	packet.  This is not as simple as it may seem at first. Once  this
422	is  done, the relationship between sequence numbers and timing  is
423	lost.  Consider  an example. There are two frame sizes,  10ms  and
424	30ms.  Packet N contains 10ms frames, as does packet N+1 and  N+2,
425	however,  N+3 contains 30ms frames. Thus, although the  difference
426	in  sequence  number between the first and fourth  is  three,  the
427	relative  timing is not 10ms*3 or 30ms*3. Due to  this  fact,  the
428	measurement  of  jitter  is  complicated  (for  the  same  reasons
429	described in Section 4.3.1), as it should not be done between  two
430	packets  with  different  frame  sizes.  It  also  makes  recovery
431	techniques based on sequence number more complex. To resolve  this
432	problem,  we  use  a  natural  concept  in  RTP,  which   is   the
433	synchronization source (SSRC). The approach is to have a  separate
434	SSRC  for  each  frame  size in use. Then,  sequence  numbers  are
435	interpreted  for each SSRC separately. This resolves  the  problem
436	with  the  relationship between timing and sequence numbering.  It
437	also  makes jitter and delay computations simpler - they  are  now
438	done  for each SSRC separately. Furthermore, multiple jitter  (and
439	delay, loss, etc.) values are reported to the source, one for each
440	frame  size.  This  is also desirable, since the  different  frame
441	sizes  will cause different packetization delays and packet sizes,
442	which  may cause those packets to see different delays and  losses
443	in the network than other packets.

445	This  case has both advantages and drawbacks when compared to  the
446	general  case. As an advantage, timing is greatly simplified,  and
447	the  approach  falls much in line with the original intentions  of
448	RTP.  However, it causes losses in efficiency for systems  with  a
449	variety of different frame sizes in operation simultaneously. Such
450	a  situation arises naturally when flow control is applied to each
451	source  individually, as opposed to altering the  rate  and  codec
452	type for all of the active sources.

454	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 8
455	4.4  Channel ID

457	The question of channel identification may seem at first trivial -
458	simply  use a 32 bit number, much like the SSRC, and be done  with
459	it.  However, 32 bits adds significant overhead. Reduction of  the
460	number  of bits for the channel ID becomes a complex issue. Unlike
461	the  single user case, the connection may remain active  for  long
462	periods of time (days or months). The result is that channel  ID's
463	will  need to be reused during the lifetime of the connection.  It
464	is  critical  to ensure that data from different channels  is  not
465	confused  because  of  this. Large channel  ID  spacing  helps  to
466	resolve this issue (although it can not eliminate it), so an added
467	side effect of reducing the number of channel ID's possible is  an
468	increase in the likelihood of such confusion.

470	The  first question to be addressed is how many simultaneous users
471	can one expect to find in a single packet.

473	4.4.1     Number of Users

475	There are several ways to come up with some minimums and maximums.

477	Delay-bound

479	Clearly,  as  we  add  more users, the store  and  forward  delays
480	increase since the packet size gets larger. Therefore, if we bound
481	the  per-hop delay, and provide a lower bound on the codec bitrate
482	and packetization delay, an upper bound on the number of users can
483	be  obtained. Consider a 2.4 kbps codec, with a 20ms  frame  size.
484	This  is  a  reasonable minimum combination. Next,  consider  50ms
485	store  and  forward delays. For a T1, this limits  the  number  of
486	users  within a packet to 965. For a T3, it is 30 times  this,  or
487	nearly 29,000. If silence suppression is used, the number of users
488	within  a  packet is roughly half the number of active  users  (on
489	average),  thus requiring twice as many channel identifiers  (1930
490	and  58,000). This bound doesn't seem to tight. Intuitively,  even
491	965 seems too large.

493	Efficiency bound

495	The  entire purpose of multiplexing is to improve upon efficiency.
496	Therefore, we should be able to support at least as many users  as
497	is necessary to get good efficiency. Consider the typical case,  a
498	16  kbps  codec, with a 20ms packetization delay. This results  in
499	320  bits  of  data per user. If we assume IP/UDP/RTP  (20+8+12=40
500	bytes  =  320 bits), plus an additional word (32 bits) of overhead
501	per user, the efficiency vs. N becomes:

503	                 E = (320N / ((320 + 32)N + 320))

505	This  reaches an asymptote of 90%. It is desirable to be within  a
506	few percent of this, say 88%. Solving for N, this requires 7 users
507	in  a  packet, so that we must support at least 14 active channels
508	(again,  due  to  stat mux). The lower bound,  therefore,  on  the
509	number of users is around 14.

511	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 9
512	MTU Bound

514	In  many  cases, there is a maximum packet size. This  is  usually
515	around  1500 bytes. If we consider a very low bitrate  codec,  the
516	minimum block size from any particular user is 32 bits (otherwise,
517	overheads  become  very large, and we lose word alignment,  so  32
518	bits is a good minimum). Dividing 1500 bytes by 4 bytes, we obtain
519	a  maximum of 375 users. Multiplying by two, the number of  active
520	channels needed is around 750.

522	Based  on these bounds, we need to simultaneously support at least
523	10  users, and at most 750. This would imply that at least 8 to 10
524	bits of channel ID are required.

526	4.4.2     Channel ID Reuse Problem

528	It  is  important to guarantee that data from a particular channel
529	is  never  routed to a different channel; this would mean  that  a
530	user  may  hear pieces of conversations from different  users,  an
531	error  we  consider catastrophic. Such misrouting becomes possible
532	when  a  channel is torn down, and a new channel is  set  up  soon
533	after  using  the same channel ID. Such a scenario is depicted  in
534	Figure 3. Sometime after channel K is torn down, a new channel  is
535	set  up  using the same channel ID, K. If the data packets (dotted
536	lines)  are  being  delayed significantly,  blocks  from  the  old
537	channel  K may still be present in the data stream after  the  new
538	channel K is established. These blocks will then be played out  to
539	the  new  user  of  channel  K.  Protocol  support  is  needed  to
540	guarantee that this can never happen.

542	                    |  Chnl K data here  |
543	                    | .......>           |
544	                    |                    |
545	                    | .......>           |
546	                    |                    |
547	                    |                    |
548	                    |   Teardown K       |
549	                    | --------------->   |
550	                    |                    |
551	                    |   Ack Teardown K   |
552	                    | <---------------   |
553	                    |                    |
554	                    |   Setup K          |
555	                    | --------------->   |
556	                    |                    |
557	                    |   Ack Setup K      |
558	                    | <---------------   |
559	                    |   Recv old Chnl K  |
560	                    |        .........>  |
561	                    |        .........>  |
562	                 Source               Destination

564	                Figure 3: Channel ID Reuse Problem

566	The  solution  lies  in  an  intelligent signaling  protocol.  The
567	protocol  must  support  a  two-way  handshake  for  all   control
568	messages.  In  addition, three simple rules must be  obeyed  at  a

570	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 10
571	source when setting up or tearing down connections:

573	1.   When a source sends a teardown message, it stops sending data
574	  in the UDP stream for that channel. Furthermore, in the signaling
575	  message,  it  indicates the sequence number of the packet  which
576	  contained  the  last block for that channel, call this  sequence
577	  number K.
578	2.    A  source  cannot re-use a channel identifier until  it  has
579	  received an acknowledge from the destination that that particular
580	  channel was successfully torn down.
581	3.   A source cannot send begin to send data from a particular
582	channel in the UDP stream until it has received an acknowledge
583	from the destination that the setup is complete.

585	A few simple rules must also be used at the receiver:

587	1.    When  a  receiver  gets a teardown message,  it  checks  the
588	  highest SN received so far (call this sequence number M). If M >
589	  K,  the  channel is torn down, and any further blocks containing
590	  that channel ID are discarded. If M < K, blocks from that channel
591	  are accepted until the received SN exceeds K. Once this happens,
592	  the channel is torn down and no further blocks with that channel
593	  ID are accepted.
594	2.    When a setup message is received, the destination will begin
595	  to  accept blocks with the given channel identifier, but only if
596	  the sequence numbers of the packets in which they ride is greater
597	  than K.

599	The  use  of the sequence numbers allows the receiver to  separate
600	the  old channel K blocks from the new ones. This guarantees  that
601	the  destination will not misroute packets. An additional  benefit
602	is  that  the end of speech will not be clipped if the  last  data
603	packets  arrive after the teardown is received. This  protocol  is
604	quite  simple  to implement, although it requires a table  at  the
605	receiver of the values of K for each channel ID.

607	Alternate solutions to this reuse problem exist which can  operate
608	when the above restrictions are relaxed. The simplest approach  is
609	to  have  the source keep a linked list of free channel ID's.  The
610	list is initialized to contain all channel ID's, in order. When  a
611	new channel is required to be established, the channel ID is taken
612	from  the top of the list. When a channel is torn down, its ID  is
613	placed  at  the  bottom of the list. This makes the  time  between
614	channel  ID reuse as long as possible, and reduces the probability
615	of  confusion.  With  this method, it is no  longer  necessary  to
616	include  sequence  numbers in the tear down  messages.  Also,  the
617	receiver does not need to maintain a table.

619	4.4.3     Channel ID Coding

621	This  section discusses some of the options for coding the channel
622	ID field.

624	4.4.3.1   Fixed Length

626	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 11
627	The  fixed  length approach is the most straightforward.  A  fixed
628	number  of  bits is assigned to the channel ID. Issues surrounding
629	the number of bits required have been discussed above.

631	4.4.3.2   Implicit + Present Mask

633	In  reality, the channel ID's are very redundant. Both source  and
634	destination  know the set of active connections and their  channel
635	identifiers from the signalling messages. Therefore, if the blocks
636	are  placed in the packet in order of increasing channel ID,  very
637	little  information actually needs to be sent.  In  fact,  without
638	silence suppression, channel activity and the presence of a  block
639	in  a  packet  are  likely  to be equivalent,  in  which  case  NO
640	information actually needs to be sent about channel ID's.

642	Unfortunately, there are some practical problems with this. First,
643	silence suppression is used. Secondly, even if it weren't,  it  is
644	possible for the voice codecs at the ITG not to have their framing
645	synchronized (as in the general case above), so that a packet  may
646	not   contain  data  from  all  users.  Thirdly,  the  source  and
647	destination  do  NOT have a consistent view of the  state  of  the
648	system. There is a delay while signaling messages are in transit.

650	A   few   simple   mechanisms  can  be  used  to  overcome   these
651	complexities.  In the header of the packet, a mask is  sent.  Each
652	bit  in  the mask indicates whether data from a channel is present
653	in  the packet or not. Mapping of channel ID's to bits is done  by
654	sorting  the  channel ID's, and mapping the lowest number  to  the
655	first bit, next lowest to the second, etc. Therefore, if a channel
656	has  no  data for that packet, its bit is set to zero. Given  that
657	the  source  and  destination agree on how  many  connections  are
658	active at all points in time, the number of bits required is known
659	to both sides.

661	The  next  step  is  to  deal with the differences  in  state.  An
662	additional  field, called the "state-number", perhaps 5  bits,  is
663	sent  in the header of the packet. This field starts at zero. Lets
664	say  at  some point in time, its value is N. The source wishes  to
665	tear  down  a  channel.  It sends the tear  down  message  to  the
666	destination,  but continues to send data for that channel  (or  it
667	may  choose to send nothing, but must set the appropriate  bit  in
668	the  mask to zero). When the destination receives the message,  it
669	replies  with an acknowledge. When the acknowledge is received  by
670	the  source,  the source considers the channel torn down,  and  no
671	longer sends data for it, nor considers it in computing the  mask.
672	In  the packet where this happens, the source also increments  the
673	state-number field to N+1. The destination knows that  the  source
674	will  do  this, and will therefore consider the state changed  for
675	all  packets whose value of the field is N+1 or greater. When  the
676	next   signaling  message  takes  effect,  the  field  is  further
677	increased. Even if packets are lost, the value of the state-number
678	field  for  any  correctly received packet  completely  tells  the
679	destination  the  state  of the system as  seen  in  that  packet.
680	Furthermore, it is not necessary to wait for a particular setup or
681	teardown  to  be acknowledged before requesting another  setup  or
682	teardown.

684	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 12
685	The  number of bits for the state-number field should be set large
686	enough to represent the maximum number of state changes which  can
687	have taken effect during a round trip time. As an alternative,  an
688	additional  exchange can occur. After the destination  receives  a
689	packet  with  state number greater than N, it destroys  the  state
690	related  to N, and sends back, reliably, a "free-state N" message,
691	indicating  to  the destination that state N is now  de-allocated,
692	and  can  be  used  again. Until such a message is  received,  the
693	source  cannot reuse state N. This is essentially a  window  based
694	flow  control, where the flow is equal to changes in  state.  With
695	this  addition,  the number of bits for the state  number  can  be
696	safely  reduced,  and it is guaranteed that the  destination  will
697	never confuse the state, independent of the number of state-number
698	bits  used. However, the use of too few state bits can cause  call
699	blocking or delay the teardown of inactive channels.

701	This  problem  in state difference appears to be  similar  to  the
702	channel  ID  reuse  problem described in Section  4.4.2.  However,
703	there is an important difference. In the channel ID reuse problem,
704	if  the  packet containing the last block of a user arrives before
705	the  signaling message tearing down that connection, there  is  no
706	problem. The destination will generally play out silence until the
707	signaling message is received. Here, however, the destination must
708	know  that  blocks  are  no  longer present  in  the  data  stream
709	independent of when the signaling messages arrive.

711	There are some drawbacks to this approach. They require the source
712	and  destination  to maintain state. Any error  in  processing  at
713	either  end,  or  a  hardware failure, causes a complete  loss  of
714	synchronization. This "hard-state" nature of the protocol  can  be
715	relaxed by having the source send the complete state of the system
716	with  each signaling message, along with the "state-number"  field
717	for  which this state takes effect. This guarantees that  even  in
718	the  event  of  end-system  failure,  the  system  state  will  be
719	refreshed  whenever  a  new connection is set  up  or  torn  down.
720	Furthermore,  the  state  can  be  sent  periodically  to  improve
721	performance.

723	4.5  Length Indicators

725	There  are  many ways to actually code the length indicators.  The
726	first  question, however, is the range of lengths  which  must  be
727	coded.

729	4.5.1     Range of Length Indicators

731	Here,   there   is  a  clear  tradeoff  between  flexibility   and

733	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 13
734	efficiency. A larger range can accommodate a variety of  different
735	media  (such  as video) where lengths may be large. However,  this
736	comes  at  the expense of a long length field, which  may  require
737	another  word  of header to hold. For voice, one  would  expect  a
738	maximum  bitrate  to  be  64 kbps, and around  50ms  packetization
739	delay. This yields exactly 100 words of data. Therefore, an  eight
740	bit field is probably sufficient for most voice applications.

742	4.5.2     PTI Based Lengths

744	In  many applications, the amount of data present depends  on  the
745	voice codec in use. Frame based coders will generally send a frame
746	at  a time. Since the codec type is indicated by the PTI field, it
747	may  not  always be necessary to send length information  at  all.
748	Even  for non-frame based codecs, such as PCM, default data  sizes
749	can  be set in the standard (as in RFC 1890 [4]). An extension bit
750	can be used to indicate a non-standard length, so that when set, a
751	length field follows. This allows for efficient coding of the most
752	common   cases,  but  allows  for  variable  lengths  with  little
753	additional cost.

755	4.5.3     Variable Length w/ Indicator

757	In  this  approach, a variable length header is used. All  of  the
758	length indicators for all of the blocks are placed together in the
759	beginning  of  the packet. However, the first four  bits  of  this
760	header  field  indicate the number of bits used  for  each  length
761	field.  What follows are the length fields themselves, each  using
762	the number of bits indicated by the first four bits. This approach
763	scales  well,  using a small overhead when the block  lengths  are
764	small, and a larger overhead when they are larger. The drawback is
765	a  variable length header field, plus additional complexity in the
766	parsing. An example of this technique is depicted in Figure 4.  In
767	the  first  example, the four bit indicator field has a  value  of
768	three, so that the length fields are all three bits long. The four
769	lengths  are then 2,6,3, and 8. In the second example, the  4  bit
770	indicator  has a value of two, so that the length fields  are  all
771	two bits long. The four lengths are thus 3,2,1, and 3.

773	          Example A:  0011 010 110 011 100
774	          Example B:  0010  11  10  01  11

776	              Figure 4: Variable Length w/ Indicator

778	4.5.4     Remaining Packet Length Based Lengths

780	UDP  always informs RTP of how many bytes are in the payload. This
781	itself restricts the possible length of the first block, since its
782	length  must  be less than the total packet length minus  the  RTP
783	header. Furthermore, as each block is placed into the packet,  the
784	possible set of lengths that it can have shrinks - it must  always
785	be  less  than the remaining length in the packet. This  approach,
786	therefore, codes each length field with log2 of the number of bits

788	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 14
789	remaining  in the packet. This approach works extremely well  when
790	there  is a long packet followed by several shorter ones,  whereas
791	the  previous  approach performs poorly in this case. Furthermore,
792	it  eliminates  the  length  indicator  present  in  the  previous
793	approach.  However,  it  is even more complex  than  the  previous
794	technique.  It  can  result in no savings under  some  conditions,
795	especially since the header fields must be rounded to 32 bits.

797	Consider  an  example. The total size of the packet is  31  words.
798	Inside  of it are three blocks, the first whose length is 17,  the
799	second 8, and the third, 6. We would code the length field with  5
800	bits.  After this block is read, the remaining amount of  data  in
801	the  packet is 14 words. Therefore, the next length field is coded
802	with 4 bits. After this block, the remaining amount of data in the
803	packet  is 6 words, so the final length field is coded with  three
804	bits.  The  total  is therefore 5+4+3 = 12 bits. In  the  previous
805	approach  (Section  4.5.3), the entire  length  field  would  have
806	required  4  bits  for the indicator (whose  value  would  be  5),
807	followed by 3 five bit fields, for a total of 19 bits.

809	One  may  question this example since the overhead of  the  length
810	fields  itself  is  not  taken  into account  when  computing  the
811	remaining length of the packet. While this can be incorporated, it
812	makes  things even more complex, and it is not actually necessary.
813	All  that  is  required is that the length fields are  coded  with
814	log2(M),  where  M is any bound on the remaining  amount  of  data
815	which  can be deterministically computed from past information.  A
816	simple  bound  is the packet length minus the data seen  thus  far
817	(one  can  also subtract away any fixed length fields),  precisely
818	the metric used in the example above.

820	4.5.5     Table Based Approach

822	Realistically, most systems will operate with codecs that generate
823	data  in  a  fixed set of lengths (a frame size, for example).  In
824	that  case, the set of lengths which can appear in the packet  are
825	usually  very restricted. To take advantage of this fact, a  table
826	can  be  transmitted to the receiver reliably before  transmission
827	commences. This table can indicate the actual length of  a  block,
828	and  its  coding. The symbols transmitted in the data packets  are
829	then  used in this table to look up the actual lengths.  This  can
830	reduce  the  length field to 2 or 3 bits. These lengths  then  all
831	occur  next to each other in the header. The technique now  relies
832	on  state  at  the  receiver, and the parsing process  is  further
833	complicated by table lookups. In addition, the approach only works
834	if you know the set of lengths before the system begins operation.
835	If  you  allow  the  table  to be dynamically  modified  during  a
836	session,  synchronization problems occur, and the  system  becomes
837	quite complex.

839	Further  gains  can be achieved through the use of  Huffman  codes
840	instead of fixed length codes This only makes sense when different

842	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 15
843	codecs  (and  correspondingly different  lengths)  are  used  with
844	different frequencies. An example of such a situation is when  the
845	codec  changes to a higher rate because of music-on-hold;  a  rare
846	event in general.

848	4.6  Marker Bit

850	The  marker bit has a general functionality, but is normally  used
851	to  indicate  the beginning of a talkspurt. It seems like  a  good
852	idea to include this bit for each user.

854	4.7  Location of Per User Overhead

856	There  will generally be overhead on a per-user basis (information
857	such as channel ID, length, etc.). This information can be located
858	in  one of three places. First, it can all reside in front of  the
859	block  to  which it is applicable. Second, it can  all  be  pasted
860	together  and  reside up front in the header of  the  packet.  The
861	third  is  a  hybrid solution, where some of it resides  up  front
862	(such as channel ID), and some resides in front of the data. There
863	are  various pros and cons to the different approaches. The hybrid
864	approach can be complex, since data is split into multiple places.
865	The  case  where  all  the header is up  front  has  a  few  minor
866	advantages. First, it allows for a complete separation of the data
867	from  the header. The implementation is likely to be a little less
868	complex, since extracting blocks does not require actually  moving
869	through the payload.

871	5.   Options

873	5.1  Option I: Mixer Based

875	This option is the most straightforward to implement, but has  the
876	most  overhead.  The basic premise is to reuse the  mixer  concept
877	introduced in RTP. Each user is considered a contributing  source,
878	and  the gateway is considered a mixer. However, instead of mixing
879	the media, separate data from each user appear in the payload. The
880	32  bit CSRC identifies each user, acting as the channel ID.  Data
881	from each user is organized into blocks. Each block has its own 32
882	bit header, which includes the length (12 bits) in units of 32 bit
883	words,  Marker bit (1b), TimeStamp Offset (12b), and Payload  Type
884	(7b).  Furthermore, the payload type and marker bit  are  stricken
885	from  the RTP header (since they only make sense for an individual
886	user),  and the CC field expanded to fill the missing bytes.  This
887	allows  for  a 12 bit CC field, or 4096 users in a  packet.  Thus,
888	the packet would look like:

890	                        Figure 5: Option I

892	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 16
893	This approach allows for the most amount of generality in terms of
894	variable length coders and coders with different frame sizes  (see
895	Section 4.3.1). The channel ID is longer than necessary, but using
896	the   concept  of  a  contributing  source  for  the  channel   ID
897	necessitates  the  use of the additional bits. There  are  several
898	variations on option I, many of which have been mentioned above:

900	I.A:  Put the CSRC with each 32 bit length+M+PT field, instead  of
901	all  of them being at the beginning. This has some pros and  cons.
902	As  an  interesting  artifact of this  change,  it  is  no  longer
903	necessary  to  have a CC field. The length passed  up  by  UDP  is
904	sufficient  to  recover the point at where you stop  checking  for
905	additional  blocks from users in the payload. In fact, the  length
906	field in the last block is not strictly necessary either.

908	I.B:  Do  the opposite of I.A. Put the length+M+PT field up  front
909	along  with the CSRC fields, with the pattern being CSRC 1, length
910	1, CSRC 2, length 2, etc. Here again, the CC field is not strictly
911	necessary.

913	I.C:  The  CSRC  field can be shrunk to 8 bits.  This  allows  for
914	either 4 or two channel ID's to be coded in the space of one word,
915	whereas only one could in the current size of the field.

917	I.D: The CSRC field can be shrunk to 16 bits.

919	5.2  Option II: One word header

921	This  option eliminates the large channel ID field present in  the
922	previous option. In the RTP header, the CC bit is set to zero, the
923	marker  bit has no meaning, and the payload type is TBD  (possible
924	uses include an indication of the number of blocks in the packet).
925	The  RTP  timestamp  corresponds to the generation  of  the  first
926	sample,  among  all blocks, enclosed in this packet.  A  one  word
927	header precedes each block of data. The number of blocks is  known
928	by  parsing  them until the end of the RTP packet.  The  one  word
929	field  has a channel ID (8 bits), length (8 bits), Marker (1 bit),
930	timestamp offset (11 bits), and payload type (4 bits). Channel  ID
931	number  255  is reserved, and causes the header to be expanded  to
932	allow  for  greater length, payload type, and possibly channel  ID
933	encodings.  The specific format for this expanded  header  is  for
934	further study. Given the compacted payload type space, it may be a
935	good idea to allow negotiation of the meaning for the payload type
936	at the beginning of the connection. It may be worthwhile to expand
937	the length field at the expense of the channel ID - this issue  is
938	for further study.

940	The format of the packet is thus:

942	                        Figure 6: Option II

944	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 17
945	5.3  Option III - Restricted Case

947	Option  II  has  the advantage of being able to  support  multiple
948	frame  sizes  within a single packet. However,  it  comes  at  the
949	expense  of  a 32 bit header (which can be large for  low  bitrate
950	codecs), and at a reduced payload type field. This option has a 16
951	bit  header, but does not support different frame sizes  within  a
952	packet.  It therefore falls into the category described in Section
953	4.3.2. Of the 16 bit header, the first bit is an expand bit (to be
954	described  shortly),  and the second bit is the  marker  bit.  The
955	following  6 bits indicate payload type, and the remaining  8  are
956	for  channel ID. When the expand bit is set, an additional 16 bits
957	are  present, which indicate the length of the block. When  expand
958	is clear, the length is derived from the payload type. Since there
959	is  no timestamp offset, all the blocks in the packet must be time
960	aligned  and  have the same frame lengths. Different sized  frames
961	are supported by using a different SSRC for each frame length (see
962	Section  4.3.2). In the RTP header, the CC field is  always  zero.
963	The  marker  bits  and payload type are undefined.  The  timestamp
964	indicates  the  time  of generation of the first  sample  of  each
965	block.  SSRC  is  randomly chosen, but always different  for  each
966	frame size.

968	The  block headers are all located at the beginning of the packet,
969	and follow each other. If the total length of the fields is not  a
970	multiple of 32 bits, it is padded out to 32. The structure of  the
971	header  is  such that fields never break across packet boundaries.
972	An  example  of such a packet is given in Figure 7.  There  are  7
973	blocks in this example. The first two have standard lengths  based
974	on  the  PT field. The next one uses the expansion bit to indicate
975	the  length. The fourth uses the PT field, the fifth the expansion
976	bit,  and the last two use the PT field. The last 16 bits  of  the
977	header are padded out.

979	                       Figure 7: Option III

981	5.4  Option IV - Stacked RTP

983	This  approach uses a duplicate of the RTP header as the per-block
984	header.  It  is  therefore  extremely inefficient  (12  bytes  per
985	block), but has several advantages: different media types  can  be
986	mixed,  since  the  timestamps are no longer related,  and  little
987	processing is required if the sources being combined came  from  a
988	single  user RTP source. It also works well when one of the  users
989	is  actually a mixer (for example, a conference bridge), since the
990	CSRC  can be used. Its main advantage is the reduction in overhead
991	due  to  the  IP and UDP headers. In addition to the standard  RTP
992	header,  an  additional header is required for length  indication.
993	This header has a number of 16 bit fields, each of which indicates
994	a  length  for its corresponding block (including the 12 byte  RTP
995	header).  The  number of such 16 bit lengths fields  is  known  by
996	continuing  to look for additional length fields until  the  total
997	length of the packet passed up from UDP has been accounted for. If
998	an  odd  number  of  such  length  fields  is  required,  then  an
999	additional  16  bits  of padding is inserted to  make  the  length

1001	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 18
1002	header a multiple of 32 bits.

1004	The format of such a packet is given in  Figure 8.

1006	                        Figure 8: Option IV

1008	5.5  Option V: Compacted

1010	This  option uses the Implicit + Mask approach outlined in Section
1011	4.4.3.2  to  code  the  channel ID. In all other  respects  it  is
1012	similar to Option III. Now, however, the per-block header  can  be
1013	reduced  to one byte: 1 bit of expansion, 1 bit of marker,  and  6
1014	bits  of payload type. Furthermore, the length field (present when
1015	the  expansion bit is set) is reduced to 8 bits from 16 in  Option
1016	III.  This  reduction saves on space, but it also guarantees  that
1017	fields  remain  aligned  on byte boundaries.  The  mask  bits  are
1018	present in the beginning of the packet, and they are preceded by a
1019	8  bit  state-number. If the number of active channels  is  not  a
1020	multiple of 32, the mask field is padded out to a full word.  This
1021	approach  is  extremely efficient, but the channel  identification
1022	procedure  is  more  complex  and  requires  additional  signaling
1023	support.

1025	A  diagram of a typical packet for this option is given in  Figure
1026	9.  The  marker bits are indicated with lowercase m's.  There  are
1027	four active channels, each of which is present in this packet (all
1028	four  mask  bits would then be 1). The first block has a  standard
1029	length, but the second has its expansion bit set, so that an 8 bit
1030	length field follows. The remaining two blocks have normal  8  bit
1031	headers.  The  last 24 bits of the header are  padded  to  a  word
1032	boundary.

1034	                        Figure 9: Option V

1036	6.   Comparison of Options

1038	In  this section, the options are compared in terms of efficiency.
1039	Issues  relating  to complexity, scalability, and generality  have
1040	already  been  discussed in previous sections. The  analysis  here
1041	consists  of  two  parts.  The first is a  table,  indicating  the
1042	efficiency of each option for a variety of speech codecs.  Several
1043	tables  are  included for different numbers of users.  The  second
1044	analysis  consists  of  a  series of  graphs  which  consider  the
1045	efficiency vs. bitrate, assuming a fixed frame size and a  certain
1046	number  of  users. This analysis helps to indicate  the  range  of
1047	codecs which may be reasonably supported with each option.

1049	6.1  Specific Codecs

1051	In  both  Table  1 and Table 2, the efficiency vs. codec  for  all
1052	three options is tabulated. For G.711, G.726, G.728 and G.722, the
1053	frame  size listed is a multiple of the actual frame size  of  the
1054	codec, which is too small to be sent one at a time. The efficiency
1055	is  computed as the number of words of payload such a codec  would

1057	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 19
1058	occupy,  times  the number of users, divided by the  total  packet
1059	size (i.e., it does not consider inefficiencies due to padding the
1060	payload  portion).  Note  that Option  V  is  always  superior  in
1061	efficiency. The efficiencies are generally 1 to 10 percent  apart.
1062	Table  1 considers the case where there are 10 users, and Table  2
1063	considers the case where there are 24.

1065	Codec     Bitrate  FrameSize Opti  Optio Optio Optio Optio  Optio  Optio
1066	         (kbps)   (ms)      on    n I.C n I.D n II  n III  n IV   n V
1067	                          I
1068	G.711          64       20 93.0 94.56 94.12 95.24 96.39 90.50 96.84
1069	                             2%     %     %     %     %     %    %
1070	G.726,         32       20 86.9 89.69 88.89 90.91 93.02 82.64 93.88
1071	                             6%     %     %     %     %     %    %
1072	G.728,         16    18.75 76.9 81.30 80.00 83.33 86.96 70.42 88.47
1073	                             2%     %     %     %     %     %    %
1074	G.729           8       10 50.0 56.60 54.55 60.00 66.67 41.67 69.72
1075	                             0%     %     %     %     %     %    %
1076	G.723         5.3       30 62.5 68.49 66.67 71.43 76.92 54.35 79.33
1077	                             0%     %     %     %     %     %    %
1078	G.723         6.3       30 66.6 72.29 70.59 75.00 80.00 58.82 82.16
1079	                             7%     %     %     %     %     %    %
1080	ITU 4kbps       4       20 50.0 56.60 54.55 60.00 66.67 41.67 69.72
1081	                             0%     %     %     %     %     %    %
1082	G.722          64       15 90.9 92.88 92.31 93.75 95.24 87.72 95.84
1083	                             1%     %     %     %     %     %    %
1084	GSM  Full      13       20 75.0 79.65 78.26 81.82 85.71 68.18 87.35
1085	Rate                         0%     %     %     %     %     %    %
1086	TCH  Half     5.6       20 57.1 63.49 61.54 66.67 72.73 48.78 75.43
1087	Rate                         4%     %     %     %     %     %    %
1088	IS54         7.95       20 62.5 68.49 66.67 71.43 76.92 54.35 79.33
1089	                             0%     %     %     %     %     %    %
1090	IS96          8.5       20 66.6 72.29 70.59 75.00 80.00 58.82 82.16
1091	                             7%     %     %     %     %     %    %
1092	EVRC          8.5       20 66.6 72.29 70.59 75.00 80.00 58.82 82.16
1093	                             7%     %     %     %     %     %    %
1094	PDC  Full     6.7       20 62.5 68.49 66.67 71.43 76.92 54.35 79.33
1095	Rate                         0%     %     %     %     %     %    %
1096	PDC  Half    3.45       40 62.5 68.49 66.67 71.43 76.92 54.35 79.33
1097	Rate                         0%     %     %     %     %     %    %
1098	                         Table 1: 10 Users

1100	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 20
1101	Codec     Bitrat  FrameSize Optio Optio  Optio  Optio  Optio  Optio  Optio
1102	         e       (ms)      n  I  n I.C  n I.D  n II   n III  n IV   n V
1103	         (kbps)
1104	G.711         64       20 94.30 96.00 95.43 96.58 97.76 91.34 98.26
1105	                              %     %     %     %     %     %    %
1106	G.726         32       20 89.22 92.31 91.25 93.39 95.62 84.06 96.57
1107	                              %     %     %     %     %     %    %
1108	G.728         16    18.75 80.54 85.71 83.92 87.59 91.60 72.51 93.37
1109	                              %     %     %     %     %     %    %
1110	G.729          8       10 55.38 64.29 61.02 67.92 76.60 44.17 80.87
1111	                              %     %     %     %     %     %    %
1112	G.723        5.3       30 67.42 75.00 72.29 77.92 84.51 56.87 87.57
1113	                              %     %     %     %     %     %    %
1114	G.723        6.3       30 71.29 78.26 75.79 80.90 86.75 61.28 89.42
1115	                              %     %     %     %     %     %    %
1116	ITU 4kbps      4       20 55.38 64.29 61.02 67.92 76.60 44.17 80.87
1117	                              %     %     %     %     %     %    %
1118	G.722         64       15 92.54 94.74 93.99 95.49 97.04 88.78 97.69
1119	                              %     %     %     %     %     %    %
1120	GSM  Full     13       20 78.83 84.38 82.44 86.40 90.76 70.36 92.69
1121	Rate                          %     %     %     %     %     %    %
1122	TCH  Half    5.6       20 62.34 70.59 67.61 73.85 81.36 51.34 84.93
1123	Rate                          %     %     %     %     %     %    %
1124	IS54        7.95       20 67.42 75.00 72.29 77.92 84.51 56.87 87.57
1125	                              %     %     %     %     %     %    %
1126	IS96         8.5       20 71.29 78.26 75.79 80.90 86.75 61.28 89.42
1127	                              %     %     %     %     %     %    %
1128	EVRC         8.5       20 71.29 78.26 75.79 80.90 86.75 61.28 89.42
1129	                              %     %     %     %     %     %    %
1130	PDC  Full    6.7       20 67.42 75.00 72.29 77.92 84.51 56.87 87.57
1131	Rate                          %     %     %     %     %     %    %
1132	PDC  Half   3.45       40 67.42 75.00 72.29 77.92 84.51 56.87 87.57
1133	Rate                          %     %     %     %     %     %    %
1134	                         Table 2: 24 Users

1136	6.2  Efficiency vs. Bitrate
1137	The  following figure considers the efficiency of the protocol vs.
1138	bitrate. For this case, the frame size is fixed at 20ms,  and  the
1139	number  of  users  at 24. As the bitrate varies,  the  block  size
1140	varies,  and therefore the efficiency does as well. The efficiency
1141	here  is  computed in a slightly different manner than  the  graph
1142	above.  Here, the efficiency is the bitrate times the  frame  size
1143	(without  padding to 32 bits), divided by the same  quantity  plus
1144	the  packet and block overhead. This avoids the otherwise sawtooth
1145	behavior of the graph, which makes it very difficult to read.

1147	The  graph  is very illustrative. The ordering of the efficiencies
1148	is  no  surprise;  option  V  is  always  superior.  However,  the
1149	difference  between  the  options  is  interesting.  Despite   the
1150	difference in overhead by a factor of two, Option V and Option III
1151	are very close in efficiencies over a wide range of bitrates. This
1152	is due to the fact that it requires a lot of users at low bitrates
1153	to   overcome  the  IP/UDP/RTP  header  overhead,  and  at  higher
1154	bitrates,  the  payload  sizes  are  large  enough  to  make   the
1155	difference in block headers inconsequential.

1157	J. Rosenberg, H. Schulzrinne   Expires 5/26/97               Pg. 21
1158	7.   References
1159	_______________________________
1160	[1]  R.  Ramjee, J. Kurose, D. Towsley, H. Schulzrinne,  "Adaptive
1161	Playout Mechanisms for Packetized Audio Applications in Wide  Area
1162	Networks", Proceedings of IEEE Infocom, 1994
1163	[2] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, "RTP:  A
1164	Transport  Protocol  for  Real-Time  Applications",  Audio  Visual
1165	Working Group Request for Comments RFC 1889, IETF, January 1996
1166	[3] M. Handley, V. Hardman, I. Kouvelas, C. Perkins, J. Bolot,  A.
1167	Vega-Garcia,   S.  Fosse-Parisis,  "Payload  Format   Issues   for
1168	Redundant Encodings in RTP", Work In Progress
1169	[4]  H.  Schulzrinne, "RTP Profile for Audio and Video Conferences
1170	with  Minimal  Control", Audio Visual Working  Group  Request  for
1171	Comments RFC 1890, IETF, January 1996