idnits 2.17.1 

draft-falk-xcp-spec-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 17.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 1515.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1526.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1533.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1539.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     The X, Delta_Throughput, and RTT fields are unused and SHOULD be
     set to zero.  A router MUST not perform any processing on a minimal
     format header.  This format is intended for use in empty ACK packets, to
     return congestion information from receiver to sender.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (July 5, 2007) is 6141 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Missing Reference: 'TBD' is mentioned on line 372, but not defined

  == Missing Reference: 'IP2' is mentioned on line 527, but not defined

  == Missing Reference: 'XCP' is mentioned on line 530, but not defined

  == Missing Reference: 'IP1' is mentioned on line 529, but not defined

  == Missing Reference: 'TCP' is mentioned on line 531, but not defined

  -- Obsolete informational reference (is this intentional?): RFC  793
     (Obsoleted by RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC  813
     (Obsoleted by RFC 7805)

  -- Obsolete informational reference (is this intentional?): RFC 2309
     (Obsoleted by RFC 7567)

  -- Obsolete informational reference (is this intentional?): RFC 2581
     (Obsoleted by RFC 5681)

  -- Obsolete informational reference (is this intentional?): RFC 2960
     (Obsoleted by RFC 4960)


     Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                            A. Falk
3	Internet-Draft                                               Y. Pryadkin
4	Intended status: Experimental                                        ISI
5	Expires: January 6, 2008                                       D. Katabi
6	                                                                     MIT
7	                                                            July 5, 2007

9	         Specification for the Explicit Control Protocol (XCP)
10	                       draft-falk-xcp-spec-03.txt

12	Status of this Memo

14	   By submitting this Internet-Draft, each author represents that any
15	   applicable patent or other IPR claims of which he or she is aware
16	   have been or will be disclosed, and any of which he or she becomes
17	   aware will be disclosed, in accordance with Section 6 of BCP 79.

19	   Internet-Drafts are working documents of the Internet Engineering
20	   Task Force (IETF), its areas, and its working groups.  Note that
21	   other groups may also distribute working documents as Internet-
22	   Drafts.

24	   Internet-Drafts are draft documents valid for a maximum of six months
25	   and may be updated, replaced, or obsoleted by other documents at any
26	   time.  It is inappropriate to use Internet-Drafts as reference
27	   material or to cite them other than as "work in progress."

29	   The list of current Internet-Drafts can be accessed at
30	   http://www.ietf.org/ietf/1id-abstracts.txt.

32	   The list of Internet-Draft Shadow Directories can be accessed at
33	   http://www.ietf.org/shadow.html.

35	   This Internet-Draft will expire on January 6, 2008.

37	Copyright Notice

39	   Copyright (C) The IETF Trust (2007).

41	Abstract

43	   This document contains an initial specification for the Explicit
44	   Control Protocol (XCP), an experimental congestion control protocol.
45	   XCP is designed to deliver the highest possible end-to-end throughput
46	   over a broad range of network infrastructure, including links with
47	   very large bandwidth-delay products, which are not well served by the
48	   current control algorithms.  XCP is potentially applicable to any
49	   transport protocol, although initial testing has applied it to TCP in
50	   particular.  XCP routers are required to perform a small calculation
51	   on congestion state carried in each data packet.  XCP routers also
52	   periodically recalculate the local parameters required to provide
53	   fairness.  On the other hand, there is no per-flow congestion state
54	   in XCP routers.

56	   XCP is currently not ready for wide-scale deployment on the public
57	   Internet and the intent of this document is to provide a starting
58	   point for future experimentation and development as well as to record
59	   the authors' implementation experiences and caveats.

61	Table of Contents

63	   1.  Changes Since Last Version . . . . . . . . . . . . . . . . . .  4
64	   2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
65	     2.1.  XCP Protocol Overview  . . . . . . . . . . . . . . . . . .  6
66	   3.  The Congestion Header  . . . . . . . . . . . . . . . . . . . . 10
67	     3.1.  Header placement . . . . . . . . . . . . . . . . . . . . . 10
68	     3.2.  Congestion Header Formats  . . . . . . . . . . . . . . . . 10
69	     3.3.  IPsec issues . . . . . . . . . . . . . . . . . . . . . . . 13
70	     3.4.  NAT, middlebox issues  . . . . . . . . . . . . . . . . . . 13
71	     3.5.  MPLS/Tunneling Issues  . . . . . . . . . . . . . . . . . . 14
72	   4.  XCP Functions  . . . . . . . . . . . . . . . . . . . . . . . . 15
73	     4.1.  End-System Functions . . . . . . . . . . . . . . . . . . . 15
74	       4.1.1.  Sending Packets  . . . . . . . . . . . . . . . . . . . 15
75	       4.1.2.  Processing Feedback at the Receiver  . . . . . . . . . 17
76	       4.1.3.  Processing Feedback at the Sender  . . . . . . . . . . 17
77	     4.2.  Router functions . . . . . . . . . . . . . . . . . . . . . 19
78	       4.2.1.  Calculations Upon Packet Arrival . . . . . . . . . . . 20
79	       4.2.2.  Calculations Upon Control Interval Timeout . . . . . . 21
80	       4.2.3.  Calculations Upon Packet Departure . . . . . . . . . . 23
81	       4.2.4.  The Control Interval . . . . . . . . . . . . . . . . . 26
82	       4.2.5.  Obtaining the Persistent Queue . . . . . . . . . . . . 26
83	   5.  Unresolved Issues  . . . . . . . . . . . . . . . . . . . . . . 29
84	     5.1.  XCP With Non-XCP Routers . . . . . . . . . . . . . . . . . 29
85	     5.2.  Variable Rate Links  . . . . . . . . . . . . . . . . . . . 30
86	     5.3.  XCP as a TCP PEP . . . . . . . . . . . . . . . . . . . . . 30
87	     5.4.  Sharing resources between XCP and TCP  . . . . . . . . . . 31
88	     5.5.  A Generalized Router Model . . . . . . . . . . . . . . . . 31
89	     5.6.  Host back-to-back operation  . . . . . . . . . . . . . . . 31
90	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 33
91	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 35
92	   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 36
93	   9.  Informative References . . . . . . . . . . . . . . . . . . . . 37
94	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 40
95	   Intellectual Property and Copyright Statements . . . . . . . . . . 41

97	1.  Changes Since Last Version

99	   Changes between version -02 and -03

101	   o  Added text in the abstract and introduction clarifying
102	      experimental nature of the protocol in preparation for ICCRG
103	      review.

105	   o  Re-ordered XCP header fields to allow more cycles for XCP
106	      calculations in lines 20-22 when packets are streamed 32-bits per
107	      cycle.  This change improved performance of an experimental FPGA
108	      implementation.

110	   o  Incremented version number (to 0x03) to reflect the change in the
111	      XCP header

113	   o  Added a subsection on malicious receivers.

115	   o  Moved per-packet divisions from the router to the sender.

117	   Changes between version -01 and -02

119	   o  Minor edits and typo fixes.

121	   Changes between version -00 and -01

123	   o  Updated protocol to reflect movement of the per-packet division
124	      from the router to the end-system.

126	   o  Incremented version number (to 0x02) to reflect change in packet
127	      header format.

129	   o  Reordered the Protocol, Length, Version, and Format fields in the
130	      congestion header (in anticipation of future support of IPv6
131	      extension headers).

133	   o  Routers now MUST (from SHOULD) ignore fields other than
134	      reverse_feedback when minimal header is used.

136	   o  No longer ignore packets with RTT set to zero.  Senders with
137	      coarse-grained timer may generate these if the RTT is less than
138	      the timer precision.

140	   o  Added 'Open Issues' section on variable-rate links.

142	2.  Introduction

144	   The Van Jacobson congestion control algorithms [Jacobson88] [RFC2581]
145	   are used by the Internet transport protocols TCP [RFC0793] and SCTP
146	   [RFC2960].  The Jacobson algorithms are fundamental to stable and
147	   efficient Internet operation, and they have been highly successful
148	   over many orders of magnitude of Internet bandwidth and delay.

150	   However, the Jacobson congestion control algorithms have begun to
151	   reach their limits.  Gigabit-per-second file transfers, lossy
152	   wireless links, and high latency connections are all driving current
153	   TCP congestion control outside of its natural operating regime.  The
154	   resulting performance problems are of great concern for important
155	   network applications.

157	   The original Jacobson algorithm was a purely end-to-end solution,
158	   requiring no congestion-related state in routers.  More recent
159	   modifications have backed off from this purity.  Active queue
160	   management (AQM) in routers (e.g., RED) [RFC2309] improves
161	   performance by keeping queues small, while Explicit Congestion
162	   Notification (ECN) [RFC3168] passes one bit of congestion information
163	   back to senders.  These measures do improve performance, but there is
164	   a limit to how much can be accomplished without more information from
165	   routers.  The requirement of extreme scalability together with
166	   robustness has been a difficult hurdle to accelerating information
167	   flow.

169	   This document concerns the Explicit Control Protocol (XCP) developed
170	   by Dina Katabi of MIT [KHR02].  XCP represents a significant advance
171	   in Internet congestion control: it extracts congestion information
172	   directly from routers, without any per-flow state.  XCP should be
173	   able to deliver the highest possible application performance over a
174	   broad range of network infrastructure, including extremely high speed
175	   and very high delay links that are not well served by the current
176	   control algorithms.  XCP achieves fairness, maximum link utilization,
177	   and efficient use of bandwidth.  XCP is novel in separating the
178	   efficiency and fairness policies of congestion control, enabling
179	   routers to put available capacity to work quickly while
180	   conservatively managing the allocation of capacity to flows.  XCP is
181	   potentially applicable to any transport protocol, although initial
182	   testing has applied it to TCP in particular.

184	   XCP's scalability is built upon the principle of carrying per-flow
185	   congestion state in packets.  XCP packets carry a congestion header
186	   through which the sender requests a desired throughput.  Routers make
187	   a fair per-flow bandwidth allocation without maintaining any per-flow
188	   state.  This enables the sender to learn the bottleneck router's
189	   allocation to a particular flow in a single round trip.

191	   The gains of XCP come with some pain.  XCP is more difficult to
192	   deploy than other proposed Internet congestion control improvements,
193	   since it requires changes in the routers as well as in end systems.
194	   It will be necessary to develop and test XCP with real user traffic
195	   and in real environments, to gain experience with real router and
196	   host implementations and to collect data on performance.  Providing
197	   specifications is an important step towards enabling experimentation
198	   which, in turn, will lead to deployment.  XCP deployment issues will
199	   be addressed in more detail in a subsequent version of this document.

201	   XCP, in its current form, is intended for deployment the public
202	   Internet.  The protocol requires participation by all queues in the
203	   path which could become bottleneck, an unlikely occurrence.
204	   Additionally, interoperability with other congestion control
205	   mechanisms, such as implemented in TCP, remains an open issue and
206	   overshadow potential gains that XCP might provide.  This document
207	   therefore is intended only facilitate controlled experimentation, to
208	   record authors' implementation, and provide a starting point for
209	   future development.

211	   This document contains an initial specification of the protocol and
212	   algorithms used by XCP, as an experimental protocol.  The XCP
213	   algorithms defined here are based upon Katabi's SIGCOMM paper
214	   [KHR02], her MIT thesis [Katabi03], and her ns simulation.  However,
215	   this document includes algorithmic modifications and clarifications
216	   that have arisen from early experience with implementing and testing
217	   XCP at the USC Information Sciences Institute.  (See
218	   http://www.isi.edu/isi-xcp for our project page.)  This document is
219	   intended to provide a baseline for further engineering and testing of
220	   XCP.

222	   This document is organized as follows.  The remainder of Section 2
223	   provides an overview of the XCP protocol, Section 3 discusses the
224	   format of the congestion header, Section 4 describes the functions
225	   occurring in the end-systems and routers, and Section 5 lists some
226	   unresolved issues.

228	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
229	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
230	   document are to be interpreted as described in [RFC2119].

232	2.1.  XCP Protocol Overview

234	   The participants in the XCP protocol include sender hosts, receiver
235	   hosts, and intermediate nodes in which queuing occurs along the path
236	   from the sender to the receiver.  The intermediate nodes are
237	   generally routers, but link-layer switches may also contain packet
238	   queues.

240	   XCP supplies feedback from the network to the sender on the maximum
241	   rate (throughput) for injecting data into the network.  XCP feedback
242	   is acquired through the use of a congestion header on each packet
243	   that is sent.  Routers along the path may update the congestion
244	   header as it moves from the sender to the receiver.  The receiver
245	   copies the network feedback into outbound packets of the same flow.
246	   An end-system may function as both a sender and a receiver in the
247	   case of a bidirectional flow.

249	   The figure below illustrates four entities participating in XCP.  The
250	   sender initializes the congestion header, two routers along the way
251	   may update it, and the receiver copies the feedback from the network
252	   into a returning packet in the same flow.

254	   +----------+        +--------+     +--------+        +----------+
255	   |          |------->| Router |---->| Router |------->|          |
256	   |  Sender  |        +--------+     +--------+        | Receiver |
257	   |          |<----------------------------------------|          |
258	   +----------+                                         +----------+

260	   The congestion header contains four pieces of data:

262	   o  RTT: Set by the sender to its current estimate of the round-trip
263	      time.

265	   o  X: Set by the sender to its current estimate of inter-packet time
266	      gap.  This quantity is used in place of Throughput in earlier
267	      drafts of these specifications to avoid per-packet division in the
268	      router.  See Section 4.1.1 for more about X.

270	   o  Delta_Throughput: Initialized to the amount which the sender would
271	      like to change (increase or decrease) its throughput, and updated
272	      by the routers along the path to be the network's allocated change
273	      in throughput.  This value will be a negative number if an XCP-
274	      capable queue along the path wants the sender to slow down.

276	   o  Reverse_Feedback: When a data packet reaches the receiver, its
277	      Delta_Throughput value is returned to the sender in the
278	      Reverse_Feedback field of a congestion header of a returning
279	      packet (e.g., in an ACK packet).

281	   An XCP-capable router calculates a fair capacity re-allocation for
282	   each packet.  A flow only receives this re-allocation from a
283	   particular router if that router is the bottleneck for that flow.
284	   For XCP, a bottleneck router is defined to be a router that has
285	   insufficient capacity to accept a flow's current or desired
286	   throughput.

288	   Based on current conditions, an XCP-capable router generates positive
289	   or negative feedback each time a packet arrives and compares it the
290	   packet's Delta_Throughput field.  Delta_Throughput is reduced if the
291	   current value exceeds this calculated feedback allocation.  Each XCP-
292	   capable router along the path from sender to receiver performs this
293	   processing.  A packet reaching the receiver therefore contains the
294	   minimal feedback allocation from the network, i.e., the capacity
295	   reallocation from the bottleneck router.

297	   The receiver copies this value into the Reverse_Feedback field of a
298	   returning packet in the same flow (e.g., an ACK or DATA-ACK for TCP)
299	   and, in one round-trip, the sender learns the flow's per-packet
300	   throughput allocation.

302	   The sender uses the reverse feedback information to adjust its
303	   allowed sending rate.  For the transport protocol TCP [RFC0793], for
304	   example, this may be accomplished by adjusting the congestion window,
305	   or cwnd, that limits the amount of unacknowledged data in the
306	   network.  (Cwnd is defined for Van Jacobson congestion control in
307	   [RFC2581].)

309	   Additionally, it is possible to use XCP's explicit notification of
310	   the bottleneck capacity allocation for other types of applications.
311	   For example, XCP may be implemented to support multimedia streams
312	   over DCCP [RFC4340] or other transport protocols.

314	   An XCP-capable router maintains two control algorithms on each output
315	   port: a congestion controller and a fairness controller.  The
316	   congestion controller is responsible for making maximal use of the
317	   outbound link while at the same time draining any standing queues.
318	   The fairness controller is responsible for fairly allocating
319	   bandwidth to flows sharing the link.  These two algorithms are
320	   executed only periodically, at an interval known as the "control
321	   interval".  The algorithms defined below set this interval to the RTT
322	   averaged across all flows.  Further work on choosing an appropriate
323	   value for the control interval may be required.

325	   Each port-specific instance of XCP is independent of every other, and
326	   references to an "XCP router" should be considered an instance of XCP
327	   running on a particular output port.

329	      Actually, it is an oversimplification to say that congestion in
330	      routers only appears at output ports.  Routers are complex devices
331	      which may experience resource contention in many forms and
332	      locations.  Correctly expressing congestion which doesn't occur at
333	      the router output port is a topic for further study.  Even so, it
334	      is important to correctly identify where the queue will build up
335	      in a router.  The XCP algorithm will drain a standing queue;
336	      however it is necessary to measure that queue in order for correct
337	      operation.  For more discussion of this issue see Section
338	      Section 5.5.

340	   More context, analysis, and background can be found in [KHR02] and
341	   [Katabi03].

343	3.  The Congestion Header

345	   The congestion control data required for XCP are placed in a new
346	   header which is called the Congestion Header.

348	3.1.  Header placement

350	   The Congestion Header is located between the IP and transport
351	   headers.  This is consistent with the fact that XCP is neither hop-
352	   by-hop communication -- as in IP -- nor end-to-end communication --
353	   as in TCP or UDP -- but is rather end-system-to-network
354	   communication.  It should allow a router to "easily" locate the
355	   congestion header on a packet with no IP options.

357	      Other choices were considered for header location.  For example,
358	      making the Congestion Header a TCP option was suggested.  This
359	      made sense as the congestion information is related to the
360	      transport protocol.  However, it requires that routers be aware of
361	      the header format for every new transport protocol that might ever
362	      use XCP.  This seemed like an unreasonable burden to place on the
363	      routers and would impede deployment of new transport protocols
364	      and/or XCP.

366	      It has also been suggested that the Congestion Header be an IPv4-
367	      style option.  While this proposal is transport protocol
368	      independent, it would generally force XCP packets to take the slow
369	      path on non-XCP routers along a path.  This could severely impact
370	      performance.

372	   The XCP protocol uses protocol number [TBD], assigned by IANA.  IP
373	   packets containing XCP headers will use this protocol number in the
374	   IP header's Protocol field [RFC0791] to indicate to routers and end-
375	   systems that an XCP congestion header follows the IP header.

377	3.2.  Congestion Header Formats

379	   This section defines the XCP Congestion Header formats.  This holds
380	   for IPv4; the corresponding header for IPv6 is a subject for further
381	   study.

383	   XCP-capable IPv4 packets carry the following Congestion Header:

385	     0                   1                   2                   3
386	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
387	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
388	    |   Protocol    |    Length     |Version|Format |    unused     |
389	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
390	    |                               X                               |
391	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
392	    |                              RTT                              |
393	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
394	    |                        Reverse_Feedback                       |
395	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
396	    |                        Delta_Throughput                       |
397	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

399	   Protocol: 8 bits

401	      This field indicates the next-level protocol used in the data
402	      portion of the packet.  The values for various protocols are
403	      specified by IANA.

405	   Length: 8 bits

407	      This field indicates the length of the congestion header, measured
408	      in bytes.  Length in this version of XCP will always be 20 bytes,
409	      or 0x14.

411	   Version: 4 bits

413	      This field indicates the version of XCP that is in use.  The
414	      version of XCP described in this document corresponds to a value
415	      of 0x03.  Future values will be assigned by IANA
416	      (http://www.iana.org).  See note in the IANA Considerations
417	      section.

419	   Format: 4 bits

421	      This field contains a code to indicate the congestion header
422	      format.  The current format codes are defined below.

424	                        +-----------------+------+
425	                        |      Format     | Code |
426	                        +-----------------+------+
427	                        | Standard Format |  0x1 |
428	                        |                 |      |
429	                        |  Minimal Format |  0x2 |
430	                        +-----------------+------+

432	                                  Table 1

434	   Standard Format

436	      The standard format includes the X, Delta_Throughput, and RTT
437	      fields shown above.  This format is used by XCP in data packets
438	      flowing from sender to receiver.

440	   Minimal Format

442	      The X, Delta_Throughput, and RTT fields are unused and SHOULD be
443	      set to zero.  A router MUST not perform any processing on a
444	      minimal format header.  This format is intended for use in empty
445	      ACK packets, to return congestion information from receiver to
446	      sender.

448	   Other formats may be defined in the future, to define different
449	   representation formats for the X, Delta_Throughput, and/or RTT
450	   fields, for example.  The corresponding format values format values
451	   will be assigned by IANA.  See IANA Considerations section below.

453	   unused: 8 bits

455	      This field is unused and MUST be set to zero in this version of
456	      XCP.

458	   RTT: 32bits

460	      This field indicates the round-trip time measured by the sender,
461	      in fixed point format with 28 bits after the binary point, in
462	      seconds.  Thus, the value of 1 corresponds to 2^(-28) of a second.
463	      This field is an unsigned integer.

465	      The minimum value expressible in this field is 0s.  A value of
466	      zero in the RTT field is legal and indicates that the sender
467	      either does not yet know the round-trip time, or operates at a
468	      coarse-grained timer granularity.  The maximum value expressible
469	      in this field is 15.9999999963 seconds, in steps of 3.7ns.

471	   X: 32 bits

473	      This field indicates the inter-packet time of the flow as
474	      calculated by the sender, in fixed point format with 28 bits after
475	      the binary point, in seconds.  This is the same format as used by
476	      the RTT field.

478	   Delta_Throughput: 32 bits

480	      This field indicates the desired or allocated change in
481	      throughput.  It is set by the sender to indicate the amount by
482	      which the sender would like to adjust its throughput, and it may
483	      be subsequently reduced by routers along the path (See
484	      Section 4.2).  It is measured in bytes per second and is a signed,
485	      2's complement value.

487	      The minimum throughput change expressible in this field is -17
488	      Gbps.  The maximum value expressible in this field is 17 Gbps, in
489	      steps of 8 bits per second.

491	   Reverse_Feedback: 32bits

493	      This field indicates the value of Delta_Throughput received by the
494	      data receiver.  The receiver copies the field Delta_Throughput
495	      into the Reverse_Feedback field of the next outgoing packet in the
496	      same flow.  See Section 4.1.2.

498	3.3.  IPsec issues

500	   IPsec [RFC4301] must be slightly modified to accommodate use of XCP.
501	   The specifications for the IP Authenticated Header (AH) [RFC4302] and
502	   IP Encapsulating Security Payload (ESP) [RFC4303] state that the
503	   IPsec headers immediately follow the IP header.  This would be a
504	   problem for XCP in that a) it would make the XCP headers harder to
505	   find by the routers, b) ESP encryption would make it impossible for
506	   routers along the path to read and write congestion header
507	   information and c) AH authentication would fail if any router along
508	   the path had modified a congestion header.  Therefore, the XCP
509	   congestion header should immediately follow the IP header and precede
510	   any AH.

512	3.4.  NAT, middlebox issues

514	   Middleboxes that attempt to perform actions invisibly on flows must
515	   preserve the congestion header.  Middleboxes that terminate the TCP
516	   connection should terminate the XCP connection.  Middleboxes that
517	   insert queues into the forwarding path should participate in XCP.

519	3.5.  MPLS/Tunneling Issues

521	   When a flow enters an IP tunnel [RFC2003], IPsec ESP tunnel
522	   [RFC4303], or MPLS [RFC3031], network ingress point, the congestion
523	   header should be replicated on the "front" of the outer IP header.
524	   For example, when a packet enters an IP tunnel, the following
525	   transformation should occur:

527	                [IP2]  \_ outer header
528	         __-->  [XCP]  /
529	   [IP1]/       [IP1]  \
530	   [XCP] --->   [XCP]   |_ inner header
531	   [TCP]        [TCP]   |
532	   ...          ...    /

534	   Here the XCP header appended to the front of the outer header is
535	   copied from the inner header, with the appropriate change to the
536	   Protocol field to indicate that the next protocol is IP.

538	   When the packet exits the tunnel, the congestion header, which may
539	   have been modified by routers along the tunneled path, is copied from
540	   the outer header into the inner header.

542	4.  XCP Functions

544	   XCP is concerned with the sender and receiver end-systems and the
545	   routers along the packet's path.  This section describes the XCP-
546	   related algorithms to be implemented in each of these entities.  The
547	   specific case of TCP as transport protocol is also described.  The
548	   emphasis in this section is on explicit and detailed definition of
549	   the XCP algorithms.  The theoretical derivation and analysis of the
550	   algorithms can found in [Katabi03].

552	4.1.  End-System Functions

554	4.1.1.  Sending Packets

556	   The sender is responsible for maintaining five parameters, or their
557	   equivalent: (1) a desired throughput value, (2) a current estimate of
558	   the actual throughput, (3) the maximum throughput allowed by XCP, (4)
559	   a current estimate of inter-packet time, denoted X, and (5) a current
560	   estimate of the round-trip time

562	   A sender may choose to use any reasonable value, i.e., any achievable
563	   value, for desired throughput.  An application may supply this value
564	   via an API, or it might be the speed of the local interface.

566	   When sending a packet, the sender fills in the fields of the
567	   congestion header as follows:

569	   o  The sender sets the RTT field to a scaled smoothed round-trip time
570	      estimate, or to zero if the round-trip time is not yet known.

572	   o  The sender sets the X field to the current inter-packet time
573	      estimate, or to zero if an estimate is not yet available.  Packets
574	      carrying zero X field can receive negative feedback, but not
575	      positive.  Since XCP requires only a single round trip for a flow
576	      to gain an estimate of RTT, this is expected to have negligible
577	      effect.  The value of X may be estimated as the smoothed round-
578	      trip time estimate divided by the number of outstanding packets
579	      (or congestion window size in packets, for window-based
580	      protocols).  Alternatively, it may be derived from the ratio of
581	      the packet size to the current throughput estimate.  Using
582	      instantaneous RTT estimates in the calculation of X may yield
583	      better results than using the smoothed RTT, especially for senders
584	      with coarse-grained timers, i.e., timers with precision less than
585	      the RTT.  This is a subject for further study.  Also,see Section
586	      Section 4.1.3.3 for a discussion of RTT estimation.

588	   o  The sender calculates a desired change (typically, an increase) in
589	      throughput.  This is normally the difference between the current
590	      estimated throughput and the desired throughput.  However, if the
591	      sender does not have sufficient data to send at the current
592	      allowed throughput, the desired change in throughput SHOULD be
593	      zero.

595	   o  The sender then divides the desired throughput change by the
596	      number of packets in one round-trip time, and puts the result in
597	      the Delta_Throughput field of the Congestion Header.  This per-
598	      packet distribution of the throughput change is necessary because
599	      an XCP router does not maintain per-flow congestion state; it must
600	      treat each packet independently of others in the same flow.  The
601	      number of packets in an RTT may be estimated by the product of the
602	      current throughput and the RTT, divided by the Maximum Segment
603	      Size (MSS).

605	      The Delta_Throughput (in bytes/second) can be calculated as:

607	                              desired_throughput - Throughput
608	           Delta_Throughput = -----------------------------------
609	                                      Throughput * ( RTT/MSS )

611	           where:
612	                 desired_throughput is measured in bytes/second
613	                 Throughput is measured in bytes/second
614	                 RTT is measured in seconds
615	                 MSS is measured in bytes

617	      However, Delta_Throughput should be set to zero if, for any
618	      reason, no additional capacity is needed, e.g., there is
619	      insufficient data to maintain Throughput for the next RTT, as
620	      discussed above.

622	   o  An issue for future consideration is how to treat the case when
623	      Delta_Throughput is calculated to be < 1 Bps. Since an integer
624	      representation is passed in the Congestion Header, the result will
625	      appear as zero.  It would be possible to send a fraction of the
626	      packets in a round trip time with non-zero Delta_Throughput
627	      values.

629	   o  For TCP, the throughput estimate can be obtained by dividing the
630	      congestion window cwnd (in bytes) by RTT (in seconds), for
631	      example.  Alternatively, it could be measured.  The ratio cwnd/RTT
632	      differs from true throughput in two respects.  First, cwnd doesn't
633	      account for header size.  This may become significant should XCP
634	      be applied to real-time flows that send large numbers of small
635	      packets, but it is probably not much worry for TCP flows that tend
636	      to use the largest possible packet size.  Second, cwnd represents
637	      permission for the sender to transmit data.  If the application
638	      doesn't use all of the available cwnd, the advertised throughput
639	      will be larger than the true throughput.  This may result in an
640	      XCP router creating an unfair allocation of negative feedback to a
641	      flow.

643	4.1.2.  Processing Feedback at the Receiver

645	   An XCP receiver is responsible for copying the Delta_Throughput data
646	   it sees on arriving packets into the Reverse_Feedback field of
647	   outgoing packets.  In TCP, outgoing packets would normally be ACK-
648	   only segments.

650	   In some cases returning packets are sent less frequently than
651	   arriving packets, e.g., with delayed acknowledgments [RFC1122].  The
652	   receiver is responsible for calculating the sum of the arriving
653	   Delta_Throughput fields for placement in outgoing Reverse_Feedback
654	   fields.

656	4.1.2.1.  Feedback from the Receiver

658	   The receiver end-system returns XCP congestion feedback from the
659	   network to the sender, by copying the Delta_Throughput information
660	   from arriving packets into the Reverse_Feedback field of Congestion
661	   Headers in outgoing packets (possibly aggregating data as described
662	   in Section 4.1.2).

664	   It is possible that even empty ACK packets may create or encounter
665	   congestion in the reverse direction.  Although TCP implementations
666	   generally do not perform congestion-based pacing of empty ACK
667	   segments, some transport protocols (e.g., DCCP) may be.  Such a
668	   transport protocol may choose to use XCP congestion control on the
669	   returning ACKs as well as on the data.

671	   In the normal case of a unidirectional data flow with XCP applied
672	   only to that data flow, the feedback can be sent in a Minimal format
673	   Congestion Header, in which the RTT, X, and Delta_Throughput fields
674	   are set to zero.

676	4.1.3.  Processing Feedback at the Sender

678	   When packets arrive back to the sender carrying reverse feedback, the
679	   sender must adjust its sending rate accordingly.

681	   As noted earlier, this throughput adjustment may be made in TCP by
682	   updating the sender's congestion window, cwnd.  This should use the
683	   formula:

685	           cwnd = max(cwnd + feedback * RTT, MSS)

687	           where:
688	               cwnd     = current congestion window (bytes)
689	               feedback = Reverse_Feedback field from received packet,
690	                          (bytes/sec, may be +/-)
691	               RTT      = Sender's current round-trip time estimate
692	                          (seconds)
693	               MSS      = maximum segment size (bytes)

695	   The value of cwnd has a minimum of MSS to avoid the "Silly Window
696	   Syndrome" [RFC0813].

698	4.1.3.1.  Aging the Allowed Throughput

700	   When a sending application does not send data fast enough to fully
701	   utilize the allowed throughput, XCP should reduce the allowed
702	   throughput as time passes, to avoid sudden bursts of data into the
703	   network if the application starts to send data later.

705	   We present a slight modification of the algorithm for aging the
706	   allowed throughput below.  It is based on Section 4.5 of [Katabi03].
707	   Each RTT in which the sender sends with actual throughput which is
708	   less than the allowed throughput, the allowed throughput MUST be
709	   reduced by the following exponential averaging formula:

711	           Allowed_Throughput = Allowed_Throughput*(1-p) +
712	                                Actual_Throughput * p

714	               where: p is a parameter controlling the speed of aging,
715	                      ranged between 0 and 1.

717	   Using p = 0.5 is suggested.  Consideration of values of p or other
718	   algorithms is a research topic.

720	4.1.3.2.  Response to Packet Loss

722	   When the transport protocol is TCP, a packet drop or detection of an
723	   ECN notification [RFC3168] should trigger a transition to standard
724	   TCP congestion control behavior[RFC2581].  In other words, cwnd
725	   should be halved and Jacobson's fast retransmission/fast recovery,
726	   slow start, and congestion avoidance algorithms should be applied for
727	   the remainder of the connection or until the congestion event is
728	   known to have passed (see Section 5.1 for discussion of alternative
729	   approaches to this issue.  The assumption is that the packet drop
730	   reveals the presence of a congested non-XCP router in the path.
731	   Transitioning to standard TCP behavior is a conservative response.

733	   Note also the following:

735	   o  The change in congestion control algorithm should be delayed until
736	      the three DUPACKs have arrived, according to the Fast
737	      Retransmission/Fast Recovery algorithm[RFC2581].

739	   o  Once the change to standard TCP congestion control has occurred,
740	      cwnd should be managed using the RFC2581 algorithm.

742	   o  The X field in outgoing packets should continue to reflect the
743	      current inter-packet time.  This allows the XCP processes in the
744	      routers along the path to continue to monitor the flow's
745	      utilization.

747	   o  Further study is needed to determine whether it will be possible
748	      to return a connection to XCP congestion control, once it has
749	      transitioned to Van Jacobson mode.

751	   o  For transport protocols other than TCP, the response to a packet
752	      loss or ECN notification is a subject for further study.

754	4.1.3.3.  RTT Estimates

756	   Having a good estimate of the round trip time is more important in
757	   XCP than in Van Jacobson congestion control.  There is evidence that
758	   small errors in the RTT estimate can result in larger errors in the
759	   throughput and X estimates.  The current cwnd divided by SRTT is only
760	   an approximation of the actual throughput.  Likewise, SRTT divided by
761	   cwnd in packets is only an approximation of the highly variable
762	   inter-packet time, X. The RTT used in the ns-2 code in [KHR02] used a
763	   smoothed floating-point RTT estimator, rather than instantaneous
764	   measurements.  Additional research is needed to develop
765	   recommendations for RTT estimation.

767	4.2.  Router functions

769	   The router calculations for XCP are divided into those that occur
770	   upon packet arrival, those that occur upon control interval timeout,
771	   those that occur upon packet departure, and the assessment of the
772	   persistent queue, which uses a separate timer.  The calculations are
773	   presented in the following sections as annotated pseudo-code.

775	4.2.1.  Calculations Upon Packet Arrival

777	   When a packet arrives at a router, several parameters used by XCP
778	   need to be updated.  The steps are described in the following pseudo-
779	   code.

781	   ========================================================
782	   On packet arrival do:

784	   1. input_traffic += Pkt_size

786	   2. sum_x += X

788	   3. if (Rtt < MAX_INTERVAL) then

790	   4.   sum_xrtt += X * Rtt

792	   5. else

794	   6.   sum_xrtt += X * MAX_INTERVAL
795	   ========================================================

797	   Line 1:  The variable input_traffic accumulates the volume of data
798	      that have arrived during a control interval.  When a packet
799	      arrives, the packet size is taken from the IP header and is added
800	      to the ongoing count.

802	   Line 2:  The variable sum_x is used in the control interval
803	      calculation (see equation 4.2 of [Katabi03]) and in capacity
804	      allocation.  For each packet, values of X from the XCP header is
805	      accumulated.  It is recommended that sum_x is stored in a 64-bit
806	      unsigned integer variable.

808	   Lines 3 and 5:  A test is performed to check whether the round trip
809	      time of the flow exceeds the maximum allowable control interval.
810	      If so, MAX_INTERVAL, the maximum allowable control interval, is
811	      used in the subsequent calculations.  Too large a control interval
812	      will delay new flows from acquiring their fair allocation of
813	      capacity.  See Section 4.2.4 for a discussion of the recommended
814	      value for MAX_INTERVAL.

816	   Lines 4 and 6:  As in Line 2, the variable sum_xrtt is used in the
817	      control interval calculation.  It is recommended that it is stored
818	      in a 96-bit unsigned variable.

820	4.2.2.  Calculations Upon Control Interval Timeout

822	   When the control timer expires, several variables need to be updated
823	   as shown below.

825	   Note that several calculations show divisions.  These divisions
826	   should either be accomplished using floating-point arithmetic or
827	   integer arithmetic and appropriate scaling to avoid over- or under-
828	   flow.

830	   ======================================================== On
831	   estimation-control timeout do:

833	    7. avg_rtt = sum_xrtt / sum_x

835	    8. input_bw = input_traffic / ctl_interval

837	    9. F = a * (capacity - input_bw) - b * queue / avg_rtt

839	   10. shuffled_traffic = shuffle_function(...)

841	   11. residue_pos_fbk = shuffled_traffic + max(F,0)

843	   12. residue_neg_fbk = shuffled_traffic + max(-F,0)

845	   13. Cp = residue_pos_fbk / sum_x

847	   14. Cn = residue_neg_fbk / input_traffic

849	   15. input_traffic = 0

851	   16. sum_x = 0

853	   17. sum_xrtt = 0

855	   18. ctl_interval = max(avg_rtt, MIN_INTERVAL)

857	   19. timer.reschedule(ctl_interval)
858	   ========================================================
859	   Line 7:  Update avg_rtt by taking the ratio of the two sums
860	      accumulated in the previous section.  This value is used to
861	      determine the control interval (line 17).

863	   Line 8:  The average bandwidth of arriving traffic is calculated by
864	      dividing the bytes received in the previous control interval by
865	      the duration of the previous control interval.

867	   Line 9:  The aggregate feedback, F, is calculated.  The variable
868	      'capacity' is the ability of the outbound link to carry IP
869	      packets, in bytes/second.  The variable 'avg_rtt' was calculated
870	      in line 7.  The variable 'queue' is the persistent queue and is
871	      defined in section Section 4.2.5.  The values a and b are constant
872	      parameters.  According to [Katabi03], the constant a may be any
873	      positive number such that a < (pi/4*sqrt(2)).  A nominal value of
874	      0.4 is recommended.  The constant b is defined to be b =
875	      a^2*sqrt(2).  (If the nominal value of a is used, the value for b
876	      would be 0.226.)  Note that F may be positive or negative.

878	   Line 10:  This line establishes the amount of capacity that will be
879	      shuffled in the next control interval through the use of the
880	      shuffle_function.  Shuffling takes a small amount of the available
881	      capacity and redistributes it by adding it to both the positive
882	      and negative feedback pools.  This allows new flows to acquire
883	      capacity in a full loaded system.

885	      The recommended shuffle_function is as follows:

887	                    shuffled_traffic = max(0, 0.1 * input_bw - |F|)

889	      The variable 'input_bw' is defined above in Line 8.  Implementers
890	      may choose other functions.  It is important to consider that more
891	      shuffled traffic decreases the time for new flows to acquire
892	      capacity and converge to fairness.  However, too much shuffling
893	      may impede flows from acquiring their fair share of available
894	      capacity.  (For example, consider a setup of N flows bottlenecked
895	      downstream from the given router and another flow, not limited as
896	      those, trying to acquire its fair share.  In this case shuffling
897	      leads to under-utilization of the available bandwidth and impedes
898	      the unlimited flow.)  Shuffled_traffic is always a positive value.

900	   The objective of the feedback calculations is to obtain a per-packet
901	   feedback allocation from the router.  Lines 13 and 14 obtain factors
902	   in this calculation that, unfortunately, have no physical meaning.
903	   One might view them as per-flow capacity allocations that have some
904	   additional processing to prepare them for per-packet allocation.
905	   Note that, with the use of shuffled_traffic, a non-idle router will
906	   always start a control interval with non-zero values for both Cn and
907	   Cp.

909	   Line 11:  The variable 'residue_pos_fbk' keeps track of the pool of
910	      available positive capacity a router has to allocate.  It is
911	      initialized to the positive aggregate feedback.

913	   Line 12:  The variable 'residue_neg_fbk' keeps track of the pool of
914	      available negative capacity a router has to allocate.  It is
915	      initialized to the negative aggregate feedback.  This variable is
916	      always positive.

918	   Line 13:  This line calculates the positive feedback scale factor,
919	      Cp.  The variables residue_pos_fbk, and sum_x are defined above.

921	   Line 14:  This line calculates the negative feedback scale factor,
922	      Cn.  This is a positive value.  The definitions for
923	      residue_neg_fbk, and input_traffic are given above.

925	   Line 15-17:  Reset various counters for the next control interval.

927	   Line 18:  Set the next control interval.  The use of MIN_INTERVAL is
928	      important to establish a reasonable control interval when the
929	      router is idle.

931	   Line 19:  Set timer.

933	4.2.3.  Calculations Upon Packet Departure

935	   An XCP router processes each packet using the feedback parameters
936	   calculated above.  As stated earlier, each packet indicates the
937	   current inter-packet time (X) and a throughput adjustment,
938	   Delta_Throughput.  The router calculates a per-packet capacity change
939	   which will be compared to the Delta_Throughput field in the packet
940	   header.  Using the AIMD rule, positive feedback is applied equally
941	   per-flow, while negative feedback is made proportional to each flow's
942	   capacity.

944	   To accommodate high-speed routers, XCP uses a fixed-point numeric
945	   representation for the Congestion Header fields.  This means that the
946	   per-packet calculations defined below result in residual error that
947	   is less than 1 Bps per packet.  These errors accumulate across all
948	   the packets in a control interval, resulting in an inaccuracy in
949	   XCP's allocation of available bandwidth to flows.  Further work is
950	   needed to understand whether this will be a significant problem and,
951	   if so, whether there is any solution short of using 64 bit precision
952	   or floating point.

954	   Processing should be done according to the pseudo-code below.

956	   ========================================================
957	   On packet departure:

959	   20. pos_fbk = Cp * X

961	   21. neg_fbk = Cn * Pkt_size

963	   22. feedback = pos_fbk - neg_fbk

965	   23. if(Delta_Throughput > feedback) then

967	   24.   Delta_Throughput = feedback

969	   25. else

971	   26.   neg_fbk = min(residue_neg_fbk, neg_fbk +
972	                            (feedback - Delta_Throughput))

974	   27.   pos_fbk = Delta_Throughput + neg_fbk

976	   28. residue_pos_fbk = max(0, residue_pos_fbk - pos_fbk)

978	   29. residue_neg_fbk = max(0, residue_neg_fbk - neg_fbk)

980	   30. if (residue_pos_fbk <= 0) then Cp = 0

982	   31. if (residue_neg_fbk <= 0) then Cn = 0

984	   ========================================================

986	   Line 20:  The contribution of positive feedback for the current
987	      packet is calculated using Cp, defined in line 13, and X (the
988	      flow's advertised inter-packet time) from the Congestion Header.
989	      Note that if Cp (and Cn in Line 21) is implemented as a floating
990	      point number, this calculation would be implemented by multiplying
991	      the Cp-mantissa by the value of X, then shifting the result by the
992	      amount of the Cp-exponent.

994	   Line 21:  The contribution of negative feedback for the current
995	      packet is calculated using Cn, defined in line 14, and Pkt_size
996	      from the IP header.  This value of neg_fbk is positive.

998	   Line 22:  The router's allocated feedback for the packet is the
999	      positive per-packet feedback minus the negative per-packet
1000	      feedback.  This value may be positive or negative.

1002	   Line 23-24:  Line 23 tests whether the packet is requesting greater
1003	      capacity increase (via the packet's Delta_Throughput field) than
1004	      the router has allocated.  If so, this means the the sender's
1005	      desired throughput needs to be reduced to be the router's
1006	      allocation.  In line 24 the Delta_Throughput field in the packet
1007	      header updated with the router feedback allocation.

1009	   Line 25:  This branch is executed when the packet is requesting a
1010	      smaller throughput increase than the router's allocation.  In this
1011	      branch, and the rest of this pseudo-code, the packet header is not
1012	      updated and the remaining code is to correctly update the feedback
1013	      pool variables.

1015	   Line 26:  In this line, the packet's negative feedback contribution,
1016	      neg_fbk, is set to be the smaller of two terms.  The first term,
1017	      residue_neg_fbk, is the pool of negative feedback, i.e., this
1018	      drains the remaining negative feedback in the pool.  The second
1019	      term increases the nominal negative feedback from the router by
1020	      the amount which the Delta_Throughput is less than net router
1021	      allocation.  This allows the router to capture feedback which is
1022	      allocated by an upstream bottleneck.

1024	   Line 27:  The positive allocation, pos_fbk, is adjusted to be the sum
1025	      of Delta_Throughput and neg_fbk, from Line 26.  This is required
1026	      for the sum of pos_fbk and neg_fbk to equal Delta_Throughput.

1028	   Line 28-29:  In these two lines, the feedback pools, residue_pos_fbk
1029	      and residue_neg_fbk, are reduced by the values of pos_fbk and
1030	      neg_fbk accordingly, but prevented from going negative.

1032	   Line 30-31:  When a feedback pool becomes empty, set the scale factor
1033	      to zero, i.e., stop handing out associated feedback.

1035	4.2.4.  The Control Interval

1037	   The capacity allocation algorithm in XCP router updates several
1038	   parameters every Control Interval.  The Control Interval is currently
1039	   defined to be the average RTT of the flows passing through the
1040	   router, i.e., avg_rtt calculated in Line 7 above.  Other possible
1041	   choices for the control interval are under study.

1043	   Notes on avg_rtt:

1045	   o  In this document, the quantity 'avg_rtt' refers to the last
1046	      calculated value.  In other words, the avg_rtt calculated based on
1047	      packets arriving in the previous control interval.

1049	   o  The avg_rtt calculation should ignore packets with an RTT of zero
1050	      in the header.

1052	   o  avg_rtt MUST have a minimum value.  This is to allow flows to
1053	      acquire bandwidth from a previously idle router.  The default
1054	      minimum value, MIN_INTERVAL, should be max(5-10ms, propagation
1055	      delay on attached link).

1057	   o  avg_rtt MUST have a maximum value.  The default maximum value,
1058	      MAX_INTERVAL, should be max(0.5-1 sec, propagation delay on
1059	      attached link).

1061	4.2.5.  Obtaining the Persistent Queue

1063	   In Section 4.2.2 the variable 'queue' contains the persistent queue
1064	   over the control interval.  This is intended to be the minimum
1065	   standing queue over the queue estimation interval.

1067	   The following pseudo-code describes how to obtain the minimum
1068	   persistent queue:

1070	   ========================================================
1071	   On packet departure do:

1073	   32. min_queue = min(min_queue, inst_queue)

1075	   ========================================================

1077	   When the queue-computation timer expires do:

1079	   33. queue = min_queue

1081	   34. min_queue = inst_queue

1083	   35. Tq = max(ALLOWED_QUEUE, (avg_rtt - inst_queue/capacity)/2)

1085	   36. queue_timer.reschedule(Tq)

1087	   ========================================================

1089	   Line 32:  The current instantaneous queue length is checked each time
1090	      a packet departs compute the minimum queue size.

1092	   If avg_rtt is being used as the Control Interval, it MUST NOT be used
1093	   as the interval for measuring the minimum persistent queue.  Doing so
1094	   can result in a feed-forward loop.  For example, if a queue develops
1095	   the average RTT will increase.  If the avg_rtt increases, it takes
1096	   longer to react to the growing queue and the queue gets larger,
1097	   leading to instability.

1099	   Line 33:  Upon expiration of the queue estimation timer, Tq, the
1100	      variable queue, the persistent queue, is set to be the minimum
1101	      queue occupancy over the last Tq.

1103	   Line 34:  Upon expiration of the queue estimation timer, reset the
1104	      running estimate of the minimum queue to be the current queue
1105	      occupancy.

1107	   Line 35:  The first term in the max function, ALLOWED_QUEUE, is the
1108	      time to drain a standing queue that you are willing to tolerate.
1109	      (A nominal value of 2ms worth of queuing is recommended but this
1110	      may be tuned by implementers.)  The second term is an estimate of
1111	      the propagation delay.  In other words the persistent queue is a
1112	      queue that does not drain in a propagation delay. the division by
1113	      2 is a conservative factor to avoid overestimating the propagation
1114	      delay.

1116	   Line 36:  The queue computation timer is set.

1118	5.  Unresolved Issues

1120	   XCP is a work-in-progress.  This section describes some known issues
1121	   that need to be resolved.

1123	5.1.  XCP With Non-XCP Routers

1125	   Obviously, non-XCP routers will exist in networks before XCP becomes
1126	   ubiquitously deployed and we expect other non-XCP systems to continue
1127	   to be in the network indefinitely.  Long term non-XCP network
1128	   elements include any sort of link-level switches with queuing, e.g.
1129	   ATM switches and sophisticated Ethernet switches.  Even simple
1130	   multiplexers are non-XCP queues with very little buffering.

1132	   Sources and the network care about these non-XCP elements because any
1133	   one of them can be a site of network congestion, and if an XCP
1134	   endpoint is bottlenecked at one of these non-XCP elements, no router
1135	   feedback will inform the endpoint to slow down.  If nothing is done,
1136	   such an element will probably collapse under congestion.

1138	   Although exactly how XCP sources will operate in this environment is
1139	   an open issue, a current promising direction is for endpoints to run
1140	   a traditional end-to-end congestion detection algorithm in parallel
1141	   with the XCP algorithm and switch over to using that algorithm for
1142	   control when congestion is detected that XCP is not controlling.  For
1143	   example, an XCP source that detects 3 duplicate acknowledgments would
1144	   fall back to TCP Reno behavior.

1146	   An endpoint that is limited by its end-to-end congestion algorithm
1147	   would indicate so to XCP routers by setting a bit in the packet
1148	   header.  A router may process such packets differently than packets
1149	   from endpoints that are being controlled by XCP.  For example, the
1150	   router might allocate end-to-end controlled packets less feedback or
1151	   not reduce its feedback pools by the full amount when assigning
1152	   feedback to those packets.

1154	   Though using its end-to-end algorithm to control its sending rate, an
1155	   endpoint will also monitor the XCP feedback and if the source
1156	   discovers that the XCP feedback would be more restrictive than the
1157	   end-to-end control over a round trip time, the endpoint will revert
1158	   to following XCP feedback.  XCP feedback that is more restrictive
1159	   over a round trip time is an indication that the endpoint's
1160	   bottleneck is once again at an XCP router and the endpoint should
1161	   take advantage of the more precise XCP information.

1163	   Evaluation of these algorithms is ongoing work

1165	5.2.  Variable Rate Links

1167	   As discussed in [Zhang05], XCP may perform poorly over shared links.
1168	   When a link is shared, such as in CSMA ethernet or 802.11 wireless
1169	   networks, a single queue's drain rate is often a function of the load
1170	   in the shared medium.  So, using a constant value for the variable
1171	   'capacity' in the routing control algorithm may not work well.  For
1172	   correct operation, the XCP router's notion of capacity needs to
1173	   reflect how the link capacity is shared.

1175	5.3.  XCP as a TCP PEP

1177	   In the Internet today TCP performance-enhancing proxies (PEPs) are
1178	   sometimes used to improve application performance over certain
1179	   networks.  TCP PEPs, and the issues surrounding their use are
1180	   described in [RFC3135].  A common mechanism used in TCP PEPs is to
1181	   split a TCP connection into three parts where the first and last run
1182	   TCP and a more aggressive transport protocol is run in the middle
1183	   (across the path which generates poor TCP performance).  This
1184	   improves performance over the "problematic" portion of the path and
1185	   does not require changing the protocol stacks on the end systems.
1186	   For example, if a high-speed satellite link was used to connect a LAN
1187	   to the Internet, a TCP PEP may be placed on either side of the
1188	   satellite link.

1190	   It is not unusual today to find TCP PEPs which, to get high data
1191	   rates, do not use congestion control at all.  Of course, this limits
1192	   the environments in which they can be used.  However, XCP may be used
1193	   in between two TCP PEPs to get high transfer rates and still respond
1194	   to congestion in a correct and scalable way.

1196	   Work on using XCP as a TCP PEP is just beginning [Kapoor05].
1197	   Objectives for such a mechanism would be:

1199	   o  preserve end-to-end TCP semantics as much as possible

1201	   o  enable some form of discovery for one PEP to determine that
1202	      another PEP was in the path

1204	   o  allow for recovery should one half of a PEP pair fail or the route
1205	      change so that one or both PEPs are not on the path

1207	   o  enable aggregation of multiple flows between two PEPs.

1209	   A system which met the above objectives could also be used for
1210	   incremental deployment of XCP.  A network operator could deploy XCP-
1211	   capable routers and use PEPs at the periphery of the network to
1212	   convert from traditional TCP to XCP congestion control.  This may
1213	   result in smaller queues and improved link utilization within the XCP
1214	   network.  (This may be of more interest to wireless network providers
1215	   than to over-provisioned fiber backbones.)

1217	5.4.  Sharing resources between XCP and TCP

1219	   Katabi describes a system for sharing router bandwidth between XCP
1220	   and TCP flows that is based on sharing the output capacity based on
1221	   the average throughputs of XCP and TCP sources using the
1222	   router.[Katabi03] Two queues are installed at the router and they are
1223	   served with different weights by the forwarding engine.  The
1224	   algorithm is work-conserving; the forwarding engine is never idle
1225	   when either queue has packets in it.

1227	   A TFRC-like algorithm estimates the average congestion window of TCP
1228	   sources and the XCP algorithm estimates the average throughput of XCP
1229	   sources.  These averages are used to dynamically weight the time the
1230	   processor spends on each queue.  Initial experiments indicate that
1231	   this system can provide fairness between the two flow classes (XCP/
1232	   TCP).[Katabi03]

1234	   Open issues remain, however; for example there are questions about
1235	   how well the TFRC-like algorithm can estimate TCP throughput with
1236	   only access to local drop rates, convergence time of the weighting
1237	   algorithm has never been explored, and no system for buffer
1238	   allocation to complement the link capacity allocation has been put
1239	   forward.  These open issues are under study.

1241	5.5.  A Generalized Router Model

1243	   The XCP algorithm described here and in [Katabi03] manages congestion
1244	   at a single point in a router, most likely an output queue.  However,
1245	   resource contention can occur at many points in a router.  Input
1246	   queues, backplanes, computational resources can 'congest' in addition
1247	   to output buffers.  There is a need to develop a general model and a
1248	   variety of mechanisms to identify and manage resource contention
1249	   throughout the router.

1251	5.6.  Host back-to-back operation

1253	   XCP hosts should be capable of back-to-back operation, i.e., with no
1254	   router in the path.  Nominally, this should not be a problem.  A
1255	   sender initializes delta_throughput to the desired value, no router
1256	   modifies it and, thus, it is automatically granted.  However, it has
1257	   not yet been decided whether an XCP receiver should be capable of (or
1258	   require) adjusting the delta_throughput to request flow control from
1259	   the receiver to the sender.

1261	   At this point, XCP offers no mechanism for flow control.  (Open
1262	   question: Should it?)  It is believed that running XCP on the output
1263	   queue of a host would solve this problem.  However, it isn't clear
1264	   that the complexity is justified by the need to solve this situation.

1266	6.  Security Considerations

1268	   The presence of a header which may be read and written by entities
1269	   not participating in the end-to-end communication opens some
1270	   potential security vulnerabilities.  This section describes them and
1271	   tries to give enough context so that users can understand the risks.

1273	   Man-in-the-Middle Attacks

1275	      There is a man-in-the-middle attack where a malicious user can
1276	      force a sender to stop sending by inserting negative feedback into
1277	      flow.  This is little different from a malicious user discarding
1278	      packets belonging to a flow using VJ congestion control or setting
1279	      ECN bits.  One question worth investigating further is whether the
1280	      XCP attack is harder to diagnose.

1282	   Covert Data Channels

1284	      IPsec needs to be modified, as discussed in Section 3.3, to allow
1285	      routers to read the entire congestion header and write the
1286	      delta_feedback field.  This could become a covert data-channel,
1287	      i.e., a way in which an end-system can make data viewable to
1288	      observers in the network, on a compromised end-system.

1290	   Malicious Sources

1292	      The XCP algorithms rely on senders to advertise information about
1293	      their current RTT and X and correctly respond to feedback
1294	      delivered from the network.  Naturally, the possibility occurs
1295	      that a sender won't perform these functions correctly.  Chapter 7
1296	      of [Katabi03] and [Katabi04] examine these issues.

1298	      A source which lies about its values of X and hence throughput
1299	      cannot affect the link utilization and, in the worst case, can
1300	      unfairly acquire capacity.  However, this is equivalent to a
1301	      sender opening up multiple TCP flows.  So, there is an incentive
1302	      to lie about X. However, because X is explicitly stated in each
1303	      packet header, it is a simpler matter to police it at the edge of
1304	      the network than, say, for TCP.

1306	      A source which lies about its RTT can disrupt the router control
1307	      algorithm, particularly when a large number of sources lie about
1308	      their RTT and the router control interval is adaptive and uses the
1309	      average RTT.  However, there is little incentive to lie as it will
1310	      not affect the fair allocation of capacity and the liar will
1311	      experience the same degradation as the non-lying flows.  Lying
1312	      about RTT should be considered a weak denial-of-service attack.

1314	      A flow may also ignore negative feedback from the router.  Such a
1315	      flow can obtain unfair throughput in a congested router.  However,
1316	      as with lying, the explicit nature of XCP makes it possible to
1317	      verify that flows are responding to feedback.  For example, a
1318	      policing function in the path (presumably near the edge so that
1319	      the load is manageable and it can be expected to see packet flow
1320	      in both directions) may inspect congestion headers for a flow in
1321	      both directions.  If the policer sees negative feedback heading
1322	      towards a source and no reduction in throughput it may, e.g.,
1323	      punish the flow by severely restricting the throughput.  Note that
1324	      this can be applied on a probabilistic basis, sampling flows only
1325	      occasionally.

1327	   Malicious Receivers

1329	      An XCP receiver is required to return XCP congestion feedback
1330	      unmodified to the sender.  It is possible for a receiver to lie
1331	      about the feedback by sending a greater value than XCP actually
1332	      allocated thus causing the flow to acquire an unfair share of
1333	      bandwidth.  It is worth noting that a similar problem exists in
1334	      TCP [Sherwood05] whereby TCP receivers can ACK data before it
1335	      arrives (optimistic ACKing), however its technical implementation
1336	      in XCP is much easier.

1338	      The explicit nature of XCP though makes it possible to verify that
1339	      the receiver does a truthful job.  For example, a policer at the
1340	      edge can monitor the feedback going in both directions and may
1341	      punish flows whose feedback values have been tampered with;
1342	      another possibility is for a bottleneck router to punish packets
1343	      that report an unfairly large throughput.

1345	7.  IANA Considerations

1347	   XCP requires the assignment of an IP protocol number.  Once this
1348	   value has been assigned, the number may be inserted (by the RFC
1349	   Editor) into Section 3.1 and this paragraph may be removed prior to
1350	   publication.

1352	8.  Acknowledgements

1354	   The authors would like to acknowledge the many contributors who have
1355	   assisted in this work.  Bob Braden applied his usual sage guidance to
1356	   the project and to the spec, in particular.  Ted Faber wrote the
1357	   initial implementation framework and provided much wisdom on kernel
1358	   development and congestion control.  John Wroclawski advised on
1359	   project priorities and strategy.  Eric Coe developed the initial
1360	   implementation and testbed.  Aman Kapoor performed supporting
1361	   simulations and debugged kernel code.  Padma Haldar ported ns-2
1362	   simulation code to ns-2 distribution.  Jasmeet Bagga and Anuraag
1363	   Mittal conducted simulations on various aspects of XCP performance.
1364	   On the XCP mailing list, Tim Shepherd, Tom Henderson, and Matt Mathis
1365	   made valuable contributions to the effort.  To all the above go our
1366	   sincere thanks.

1368	   This material is based upon work supported by the National Science
1369	   Foundation under Grant No. 0230738.

1371	9.  Informative References

1373	   [Jacobson88]
1374	              Jacobson, V., "Congestion Avoidance and Control", ACM
1375	              Computer Communication Review Proceedings of the Sigcomm
1376	              '88 Symposium, August 1988.

1378	   [KHR02]    Katabi, D., Handley, M., and C. Rohr, "Internet Congestion
1379	              Control for Future High Bandwidth-Delay Product
1380	              Environments", ACM Computer Communication
1381	              Review Proceedings of the Sigcomm '02 Symposium,
1382	              August 2002.

1384	   [Kapoor05]
1385	              Kapoor, A., Falk, A., Faber, T., and Y. Pryadkin,
1386	              "Achieving Faster Access to Satellite Link Bandwidth",
1387	              IEEE 8th IEEE Global Internet  Symposium, Miami, FL, March
1388	              2005, 2005.

1390	   [Katabi03]
1391	              Katabi, D., "Decoupling Congestion Control and Bandwidth
1392	              Allocation Policy With Application to High Bandwidth-Delay
1393	              Product Networks", MIT PhD. Thesis, March 2003.

1395	   [Katabi04]
1396	              Katabi, D., "XCP's Performance in the Presence of
1397	              Malicious Flows", Second International Workshop on
1398	              Protocols for Fast Long-Distance Networks, Presentation,
1399	              February 2004.

1401	   [RFC0791]  Postel, J., "Internet Protocol", STD 5, RFC 791,
1402	              September 1981.

1404	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
1405	              RFC 793, September 1981.

1407	   [RFC0813]  Clark, D., "Window and Acknowledgement Strategy in TCP",
1408	              RFC 813, July 1982.

1410	   [RFC1122]  Braden, R., "Requirements for Internet Hosts -
1411	              Communication Layers", STD 3, RFC 1122, October 1989.

1413	   [RFC2003]  Perkins, C., "IP Encapsulation within IP", RFC 2003,
1414	              October 1996.

1416	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1417	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1419	   [RFC2309]  Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering,
1420	              S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G.,
1421	              Partridge, C., Peterson, L., Ramakrishnan, K., Shenker,
1422	              S., Wroclawski, J., and L. Zhang, "Recommendations on
1423	              Queue Management and Congestion Avoidance in the
1424	              Internet", RFC 2309, April 1998.

1426	   [RFC2581]  Allman, M., Paxson, V., and W. Stevens, "TCP Congestion
1427	              Control", RFC 2581, April 1999.

1429	   [RFC2960]  Stewart, R., Xie, Q., Morneault, K., Sharp, C.,
1430	              Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M.,
1431	              Zhang, L., and V. Paxson, "Stream Control Transmission
1432	              Protocol", RFC 2960, October 2000.

1434	   [RFC3031]  Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol
1435	              Label Switching Architecture", RFC 3031, January 2001.

1437	   [RFC3135]  Border, J., Kojo, M., Griner, J., Montenegro, G., and Z.
1438	              Shelby, "Performance Enhancing Proxies Intended to
1439	              Mitigate Link-Related Degradations", RFC 3135, June 2001.

1441	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
1442	              of Explicit Congestion Notification (ECN) to IP",
1443	              RFC 3168, September 2001.

1445	   [RFC4301]  Kent, S. and K. Seo, "Security Architecture for the
1446	              Internet Protocol", RFC 4301, December 2005.

1448	   [RFC4302]  Kent, S., "IP Authentication Header", RFC 4302,
1449	              December 2005.

1451	   [RFC4303]  Kent, S., "IP Encapsulating Security Payload (ESP)",
1452	              RFC 4303, December 2005.

1454	   [RFC4340]  Kohler, E., Handley, M., and S. Floyd, "Datagram
1455	              Congestion Control Protocol (DCCP)", RFC 4340, March 2006.

1457	   [Sherwood05]
1458	              Sherwood, R., Bhattacharjee, B., and R. Braud,
1459	              "Misbehaving TCP receivers can cause internet-wide
1460	              congestion collapse", ACM Proceedings of the 12th ACM
1461	              Conference on Computer and Communications Security (CCS
1462	              2005)  pp. 383-392, Alexandria, VA, 2005.

1464	   [Zhang05]  Zhang, Y. and T. Henderson, "An Implementation and
1465	              Experimental Study of the eXplicit Control Protocol
1466	              (XCP)", IEEE Proceedings of the 24th IEEE
1467	              International Conference on Computer Communications
1468	              (INFOCOM       2005), pp. 1037-1048, Miami, Florida, Mar
1469	              2005., 2005.

1471	Authors' Addresses

1473	   Aaron Falk
1474	   USC Information Sciences Institute
1475	   4676 Admiralty Way
1476	   Suite 1001
1477	   Marina Del Rey, CA  90292

1479	   Phone: 310-448-9327
1480	   Email: falk@isi.edu
1481	   URI:   http://www.isi.edu/~falk

1483	   Yuri Pryadkin
1484	   USC Information Sciences Institute
1485	   4676 Admiralty Way
1486	   Suite 1001
1487	   Marina Del Rey, CA  90292

1489	   Phone: 310-448-8417
1490	   Email: yuri@isi.edu

1492	   Dina Katabi
1493	   Massachusetts Institute of Technology
1494	   200 Technology Square
1495	   Cambridge, MA  02139

1497	   Phone: 617-324-6027
1498	   Email: dk@mit.edu
1499	   URI:   http://www.ana.lcs.mit.edu/dina/

1501	Full Copyright Statement

1503	   Copyright (C) The IETF Trust (2007).

1505	   This document is subject to the rights, licenses and restrictions
1506	   contained in BCP 78, and except as set forth therein, the authors
1507	   retain all their rights.

1509	   This document and the information contained herein are provided on an
1510	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1511	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
1512	   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
1513	   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
1514	   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1515	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1517	Intellectual Property

1519	   The IETF takes no position regarding the validity or scope of any
1520	   Intellectual Property Rights or other rights that might be claimed to
1521	   pertain to the implementation or use of the technology described in
1522	   this document or the extent to which any license under such rights
1523	   might or might not be available; nor does it represent that it has
1524	   made any independent effort to identify any such rights.  Information
1525	   on the procedures with respect to rights in RFC documents can be
1526	   found in BCP 78 and BCP 79.

1528	   Copies of IPR disclosures made to the IETF Secretariat and any
1529	   assurances of licenses to be made available, or the result of an
1530	   attempt made to obtain a general license or permission for the use of
1531	   such proprietary rights by implementers or users of this
1532	   specification can be obtained from the IETF on-line IPR repository at
1533	   http://www.ietf.org/ipr.

1535	   The IETF invites any interested party to bring to its attention any
1536	   copyrights, patents or patent applications, or other proprietary
1537	   rights that may cover technology that may be required to implement
1538	   this standard.  Please address the information to the IETF at
1539	   ietf-ipr@ietf.org.

1541	Acknowledgment

1543	   Funding for the RFC Editor function is provided by the IETF
1544	   Administrative Support Activity (IASA).