idnits 2.17.1 draft-ietf-rmt-bb-pgmcc-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Abstract section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There is 1 instance of too long lines in the document, the longest one being 2 characters in excess of 72. ** There are 37 instances of lines with control characters in the document. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 640: '...eouts the sender MUST force the electi...' RFC 2119 keyword, line 667: '...N-marked packets, that ACK MUST report...' RFC 2119 keyword, line 789: '.... A PGMCC sender SHOULD compare the re...' RFC 2119 keyword, line 792: '...PGMCC sender SHOULD mark the current a...' RFC 2119 keyword, line 796: '...Also, the sender MAY include some form...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (12 July 2004) is 7228 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? '2' on line 835 looks like a reference -- Missing reference section? '7' on line 853 looks like a reference -- Missing reference section? '6' on line 850 looks like a reference -- Missing reference section? '3' on line 839 looks like a reference -- Missing reference section? '1' on line 832 looks like a reference -- Missing reference section? '4' on line 843 looks like a reference -- Missing reference section? '5' on line 847 looks like a reference Summary: 8 errors (**), 0 flaws (~~), 2 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force RMT WG 2 INTERNET-DRAFT Luigi Rizzo/U. Pisa 3 draft-ietf-rmt-bb-pgmcc-03.txt Gianluca Iannaccone/Intel 4 Lorenzo Vicisano/Cisco 5 Mark Handley/UCL 6 12 July 2004 7 Expires: January 2005 9 PGMCC single rate multicast congestion control: 10 Protocol Specification 12 Status of this Document 14 This document is an Internet-Draft and is in full conformance with all 15 provisions of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering Task 18 Force (IETF), its areas, and its working groups. Note that other groups 19 may also distribute working documents as Internet-Drafts. 21 Internet-Drafts are valid for a maximum of six months and may be 22 updated, replaced, or obsoleted by other documents at any time. It is 23 inappropriate to use Internet-Drafts as reference material or to cite 24 them other than as a "work in progress". 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt 29 To view the list Internet-Draft Shadow Directories, see 30 http://www.ietf.org/shadow.html. 32 This document is a product of the IETF RMT WG. Comments should be 33 addressed to the authors, or the WG's mailing list at rmt@lbl.gov. 35 Abstract 37 This document describes PGMCC, a single rate multicast 38 congestion control scheme which is TCP-friendly and achieves 39 scalability, stability and fast response to variations in 40 network conditions. PGMCC is suitable for both non-reliable 41 and reliable data transfers. It is mainly designed for NAK- 42 based multicast protocols, and uses a window-based, TCP-like 43 control loop using positive ACKs between one representative of 44 the receiver group (the ACKER) and the sender. The ACKER is 45 selected dynamically and may change over time. 47 PGMCC is made of two components: a window-based control loop, 48 which closely mimics TCP behavior, and a fast and low-overhead 49 procedure to select (and track changes of) the ACKER. The 50 scheme is robust to measurement errors, and supports fast 51 response to changes in the receiver set and/or network 52 conditions. 54 Table of Contents 56 1. Introduction. . . . . . . . . . . . . . . . . . . . . . 4 57 1.1. Terminology. . . . . . . . . . . . . . . . . . . . . 4 58 2. Protocol Overview . . . . . . . . . . . . . . . . . . . 4 59 2.1. Packet Contents. . . . . . . . . . . . . . . . . . . 6 60 2.1.1. Data Packets. . . . . . . . . . . . . . . . . . . 6 61 2.1.2. Feedback Packets. . . . . . . . . . . . . . . . . 7 62 2.1.3. Field sizes and formats . . . . . . . . . . . . . 8 63 2.2. Window-based controller. . . . . . . . . . . . . . . 9 64 2.3. Acker Selection. . . . . . . . . . . . . . . . . . . 11 65 2.3.1. Initial Acker election. . . . . . . . . . . . . . 11 66 2.3.2. Acker dropouts. . . . . . . . . . . . . . . . . . 12 67 2.4. TCP Throughput Equation. . . . . . . . . . . . . . . 12 68 2.5. RTT measurement. . . . . . . . . . . . . . . . . . . 13 69 2.5.1. Explicit Timestamp. . . . . . . . . . . . . . . . 13 70 2.5.2. Implicit timestamp. . . . . . . . . . . . . . . . 13 71 2.5.3. Sequence numbers. . . . . . . . . . . . . . . . . 14 72 2.5.4. Recommendations . . . . . . . . . . . . . . . . . 15 73 2.6. Loss rate measurement. . . . . . . . . . . . . . . . 15 74 2.7. Timeouts . . . . . . . . . . . . . . . . . . . . . . 16 75 2.8. Interaction with feedback suppression 76 schemes . . . . . . . . . . . . . . . . . . . . . . . . . 16 77 2.9. Interaction with ECN . . . . . . . . . . . . . . . . 17 78 3. Procedures - Sender . . . . . . . . . . . . . . . . . . 17 79 4. Procedures -- Receiver. . . . . . . . . . . . . . . . . 18 80 5. Security Considerations . . . . . . . . . . . . . . . . 19 81 6. Authors' Addresses. . . . . . . . . . . . . . . . . . . 20 82 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . 20 83 8. Full Copyright Statement. . . . . . . . . . . . . . . . 21 85 1. Introduction 87 This document describes PGMCC, a single rate multicast congestion 88 control scheme which is TCP-friendly and achieves scalability, stability 89 and fast response to variations in network conditions. 91 PGMCC is designed for multicast sessions with one sender and one or more 92 receivers, and is a good match for transport protocols using negative 93 acknowledgements (NAKs) to collect feedback from the receivers. The 94 congestion control scheme implemented by PGMCC closely mimics the 95 congestion-control behavior of TCP, as it uses a window-based control 96 loop which is run between the sender and a selected receiver called the 97 ACKER. The role of the ACKER is to provide timely feedback in the same 98 way as a TCP receiver; additionally, the ACKER is selected dynamically 99 among the receivers as the one which would experience the lowest 100 throughput if separate TCP sessions were run between the sender and each 101 of the receivers. 103 Scalability in PGMCC comes from the use of negative acknowledgements 104 (NAKs) for collecting feedback from receivers other than the ACKER. As 105 a consequence, the usual techniques for NAK suppression and aggregation 106 can be used to reduce the amount of feedback to the source and improve 107 the scalability of the scheme. 109 PGMCC is designed to completely decouple congestion control from data 110 integrity. As a consequence, the scheme can work with both reliable data 111 transfer and unreliable communication protocols such as those used for 112 video or audio streaming. 114 While designed with multicast in mind, PGMCC can be equally used as a 115 replacement for TCP for unicast sessions which require a lower degree of 116 reliability than what TCP offers. 118 1.1. Terminology 120 In this document, the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", 121 "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and 122 "OPTIONAL" are to be interpreted as described in RFC 2119 and indicate 123 requirement levels for compliant PGMCC implementations. 125 2. Protocol Overview 127 PGMCC is based on two separate but complementary mechanisms: 129 o A window-based control loop which closely emulates TCP congestion 130 control. 132 The window-based control loop is simply an adaptation of the TCP 133 congestion control scheme to transport protocols where missing 134 (because of network errors or congestion) data packets are not 135 necessarily retransmitted, and so the congestion control scheme 136 cannot rely on cumulative acknowledgements. In PGMCC, the 137 ``congestion window'' is simulated using a token-based scheme which 138 permits congestion control to be decoupled from retransmission 139 state. One of the receivers in the group operates as the ACKER, i.e. 140 the node in charge of sending positive acknowledgements back to the 141 source and thus controlling the rate of the transfer. 143 o A procedure to select the ACKER. 144 The purpose of this procedure is to make sure that, in presence of 145 multiple receivers, the ACKER is dynamically selected to be the 146 receiver which would have the lowest throughput if separate TCP 147 sessions were run between the sender and each receiver. 148 For the acker selection mechanism, PGMCC uses a throughput equation 149 to determine the expected throughput for a given receiver as a 150 function of the loss rate and round-trip time. Unlike other schemes 151 [2], the TCP throughput equation is not used to determine the actual 152 sending rate, which is completely controlled by the window-based 153 control loop. 155 In principle, PGMCC's congestion control mechanism works as follows: 157 o Receivers measure the loss rate and feed this information back to 158 the sender, either in ACK or NAK messages. 160 o The sender also uses these feedback messages to measure the round- 161 trip time (RTT) to each receiver. 163 o The loss rate and RTT are then fed into PGMCC's throughput equation, 164 to determine the expected throughput to that receiver. 166 o The sender then selects as the acker the receiver with the lowest 167 expected throughput, as computed by the equation. 169 The dynamics of the acker selection mechanism are sensitive to how the 170 measurements are performed and applied. In the rest of this document we 171 suggest specific mechanisms to perform and apply these measurements. 172 Other mechanisms are possible, but it is important to understand how the 173 interactions between mechanisms affect the dynamics of PGMCC. 175 2.1. Packet Contents 177 Before specifying the sender and receiver functionality, we describe the 178 information required by PGMCC to perform its tasks. This information is 179 carried in the data packets sent by the sender, and in the feedback 180 packets sent by the receiver. As PGMCC will be used along with some 181 transport protocol, the actual data and feedback packets will contain 182 further information for use by the protocol itself. For this reason, we 183 do not specify packet formats, as these depend on the details of the 184 transport protocol used. 186 Note that the requirements of the transport protocol in terms of packet 187 generation may differ from those of PGMCC. As an example, most NAK-based 188 reliable multicast protocols do not use positive acknowledgements, but 189 PGMCC requires ACKs for clocking out data packets; unreliable transport 190 protocols might have no interest in generating NAKs for data integrity 191 purposes, yet PGMCC depends on NAKs reaching the data sender in order to 192 elect the ACKER. 194 Implementors may decide to insert PGMCC-related information in already 195 existing protocol packets whenever possible, but in cases such as the 196 ones described in the previous paragraph, it might be necessary to 197 define and generate new packets exclusively for congestion control 198 purposes. As an example, in a prototype implementation of PGMCC on top 199 of the PGM protocol [7], some of the information used by PGMCC is 200 already present in the original protocol packets, and PGMCC-specific 201 information is carried as PGM options in ODATA and NAK packets. However, 202 a new packet type has been defined for ACKs, which are generated 203 according to the rules defined in this document. 205 2.1.1. Data Packets 207 Each data packet sent by the data sender contains the following 208 information: 210 o A SEQUENCE NUMBER. This number is incremented by one for each data 211 packet transmitted. The field must be sufficiently large that it 212 does not wrap causing two different packets with the same sequence 213 number to be in the receiver's recent packet history at the same 214 time. 216 o A TIMESTAMP (or equivalent information, see Section 2.5) indicating 217 when the packet with this sequence number has been sent. There is 218 no requirement for synchronized clocks between the sender and the 219 receivers. The timestamp is used to measure network round-trip 220 times, so needs sufficient resolution for this task. A resolution 221 of 1ms would be adequate. 223 o The ACKER IDENTITY, i.e. the identity of the receiver in charge of 224 sending an acknowledgement for this data packet. The ACKER is 225 elected as a result of the process described in Section 2.3. 226 A special value is used to indicate that no ACKER is designated for 227 this packet -- this can happen at the beginning of a session or when 228 the current ACKER leaves the group. Receivers interpret this value 229 as a request to elect a new acker. 231 2.1.2. Feedback Packets 233 There are two types of feedback packets used by PGMCC: ACK packets and 234 NAK packets. 235 ACK packets are generated by the current ACKER, and are used to detect 236 loss or successful delivery of packets, and to regulate the throughput 237 accordingly. ACK packets also contain information used to determine the 238 TCP-equivalent throughput for the ACKER. 239 NAK packets are sent by any receiver who experiences loss. They contain 240 information used to determine the TCP-equivalent throughput for that 241 receiver. In an actual protocol instantiation (such as PGM [7]), NAK 242 packets might also be used by the protocol to request the retransmission 243 of specific packets, and indicate the identity of the packet being 244 requested. 246 Both ACK and NAK packets are sent by data receivers, and contain the 247 following information: 249 o The TIMESTAMP (or equivalent information) derived from the most 250 recently received data packet according to one of the techniques 251 described in Section 2.5. 252 This value is used by the sender to measure the RTT to the receiver 253 who generated this feedback packet. 255 o ``p'', the receiver's current estimate of the LOSS RATE. The loss 256 rate is measured by receivers as described in Section 2.6 258 In addition to the above, ACK packets (sent by the acker designated in 259 the corresponding data packets) must also contain the following 260 information: 262 o RX_MAX, the highest sequence number among received data packets 263 (taking care to deal with sequence number wrapping correctly). 265 o ACK_BITMAP, a bitmap indicating the receive status of the latest N 266 (typically N=32) data packets with sequence numbers RX_MAX-(N-1) to 267 RX_MAX. 269 This information is used by the sender to record which packets have been 270 received or lost, and manipulate the transmit window accordingly. Note 271 that each ACK packet contains information about multiple packets, and 272 this increases the robustness of the scheme to loss of ACK packets. 273 This is necessary because ACKs are not sent reliably (unlike TCP's ACKs, 274 which are cumulative). 276 2.1.3. Field sizes and formats 278 The following sizes and formats are suggested for the various fields 279 used by PGMCC and transmitted over the network: 281 o SEQUENCE NUMBERS 282 32 bit, unsigned, network order. 284 o TIMESTAMPS 285 32 bit, unsigned, network order. A resolution of 1ms or better is 286 desirable. 288 o ACKER IDENTITY 289 Same size and format of a network layer address (e.g. 32 bit for 290 IPv4). Note though that using an IP address for the Acker Identify 291 will cause problems with NAT traversal. Transport protocol 292 designers might examine the SSRC mechanism used by RTP [6] as an 293 alternative form of node identifier that could be used as Acker 294 Identity. 296 o LOSS RATE (``p'') 297 16-bit unsigned integer, in network format, with 0 indicating no 298 loss and 2^16-1 indicating 100% loss. 300 o ACK BITMAP 301 32-bit, in network format, with least significant bit indicating 302 receive status of packet RX_MAX. 304 2.2. Window-based controller 306 In a window-based congestion control scheme such as TCP, the 307 ``congestion window'' represents, among other things, the maximum amount 308 of packets in flight at any time, which in turn controls the throughput 309 of the session. The sender keeps track of the actual number of packets 310 in flight, basing on its transmissions and the reception of 311 acknowledgements. 313 The sender may dynamically change the size of the window, according to 314 the congestion control scheme being used. In TCP, and PGMCC, an 315 ``Additive Increase Multiplicative Decrease'' (AIMD) scheme is used: in 316 absence of loss, the window is increased by some fixed amount (typically 317 one packet) per round trip time (RTT), whereas upon loss the window is 318 reduced to a fraction of its original value (typically halved) in each 319 RTT in which a loss event is experienced. 321 In PGMCC the window is managed using a token-based mechanism, controlled 322 by two variables: 324 o A ``Window Size'', W, which describes the current window size in 325 packets. 327 o A ``Token Count'', T, which indicates the number of packets that can 328 be transmitted. T is bounded above by W. It is decremented every 329 time a packet is transmitted, and incremented every time an ACK is 330 received, according to the rules below. 332 Note that these two variables need to hold non-integer data. Typically 333 a fixed point representation with at least 16 bits for both integer and 334 fractional parts would be acceptable for implementation purposes. 336 The information contained in each ACK is used to determine how many new 337 packets are acknowledged by that ACK, and whether there are 338 unacknowledged packets which were not reported in previous ACKs. The 339 sender also schedules a timeout to react in case no ACKs are received. 341 The sender behaves as follows: 343 o INITIALIZATION 344 At startup, or after a timeout, both W and T are set to 1. 346 o ACK RECEPTION, NO LOSS DETECTED 347 If the incoming ACK reports new acknowledged packets, and no loss 348 (as defined in the next paragraph) is detected, then the window is 349 inflated by one packet per RTT. 351 NOTE: during the slow-start phase, TCP opens the window 352 exponentially up to the SSTHRESH value, which is computed by TCP 353 according to the dynamics of the session and updated upon losses. 355 We do recommend that PGMCC uses a similar strategy, but using a 356 fixed, small value for SSTHRESH (e.g. 4 packets). In fact, due to 357 the dynamicity of the ACKER, which might change on every single 358 packet, it is hard to compute a reliable estimate of the SSTHRESH 359 without keeping state for multiple receivers, and the benefits are 360 small in any event. 362 In summary, the reaction to ACK reception on no loss modifies T and 363 W as follows (here, N is the number of new packets acknowledged by 364 the incoming ACK): 366 if (W < SSTHRESH) then 367 D = min(N, SSTHRESH - W) // use the first D acks for 368 exp.opening 369 N = N - D // and the remaining ones for 370 linear opening 371 T = T + 2*D 372 W = W + D 373 endif 374 // do linear window opening with the remaining acks 375 T = T + N * ( 1 + 1/W ) 376 W = W + N/W 378 o PACKET TRANSMISSION 379 One token is consumed for each packet transmitted: 381 T = T - 1 383 o ACK RECEPTION, LOSS DETECTED 384 If the incoming ACK reports an unacknowledged data packet which is 385 followed by at least 3 acknowledged data packets, then the packet is 386 assumed to be lost and PGMCC reacts by halving the window, in the 387 same way as TCP after 3 duplicate acknowledgements. This is 388 achieved by modifying T and W as follows: 390 T = T - W/2 , W = W/2 392 to simulate the multiplicative decrease. 393 Additionally, all window manipulation is suspended for the 394 subsequent RTT. This is achieved by recording the current transmit 395 sequence number, and canceling any further manipulation of the 396 window until feedback is received for the next transmitted packet, 397 or until a timeout occurs. 399 2.3. Acker Selection 401 The ACKER selection process in PGMCC aims at locating the receiver which 402 would have the lowest throughput if each receiver were using a separate 403 TCP connection to transfer data. 405 Because the steady-state throughput of a TCP connection can be 406 characterized in a reasonably accurate way in terms of its loss rate and 407 round trip time [3], the throughput for each receiver can be estimated 408 by using these two parameters. 410 Whenever an ACK or NAK packet from any of the receivers reaches it, the 411 sender is able to compute the expected throughput T_i for that receiver 412 by using the equation shown in Section 2.4, with the round trip time RTT 413 and loss rate p and measured as described in Sections 2.5 and 2.6, 414 respectively. At any given time, the sender stores the expected 415 throughput for the current ACKER, T_acker. This value is updated every 416 time an ACK or NAK from the current ACKER is received (note that, after 417 a new ACKER is selected, the sender will typically receive ACKs from the 418 old ACKER for one RTT, and the feedback from different ACKERs might be 419 interleaved if the paths leading to them have different round trip 420 times). 422 Whenever an ACK or NAK is received from another node i (a previous ACKER 423 or some other receiver), the expected throughput T_i for that node is 424 computed, and compared with T_acker. Node i is selected as the new 425 acker if 427 T_i < C * T_acker 429 where the constant C between 0 and 1 provides some hysteresis and avoids 430 too frequent oscillations in the choice of the ACKER. A suggested value 431 for C is 0.75. 433 Note that, from an implementation point of view (see Section 2.4), it is 434 more convenient to compute T_i ^(-2), so the above equation must be 435 modified accordingly. 437 2.3.1. Initial Acker election 439 Upon reception of a data packet reporting that no acker is currently 440 selected, receivers generate a dummy NAK report which is used to elect 441 the initial acker. The NAK is sent with the usual feedback suppression 442 mechanism dictated by the transport protocol (possibly with shorter time 443 constants) to avoid feedback implosion, and the sender will select the 444 source of the first incoming NAK as the new ACKER. 446 2.3.2. Acker dropouts 448 If the ACKER decides to disconnect from the session, it can cause the 449 session to stop. To avoid this, it is recommended that an ACKER deciding 450 to leave the session informs the sender by sending an ACK packet (or a 451 duplicate) carrying an "ACKER_LEAVING" option. The reception of this 452 packet by the sender will in turn trigger an initial acker election 453 phase. 455 2.4. TCP Throughput Equation 457 Any realistic equation of TCP throughput as a function of loss event 458 rate and RTT should be suitable for use in PGMCC. Unlike other schemes 459 [2] where the throughput equation directly controls the transmit rate, 460 in PGMCC the equation is used only for acker selection purposes, and the 461 throughput values are only compared among themselves. As a consequence, 462 we can use the following equation, derived from the one presented in [3] 463 by setting RTO = 4 * RTT (as it is common practice): 465 M = 1/T = RTT_i * sqrt(p) * (1 + 9*p * (1 + 32*(p)^2)) 467 where 469 M = 1/T is proportional to the inverse of the throughput for the 470 receiver under consideration; 472 RTT is the round trip time for the receiver under consideration; 474 p is the loss rate for the receiver under consideration, between 0 475 and 1.0; 477 and multiplying constants are omitted. 479 The above equation is accurate on a wide range of loss rates, and also 480 covers situations where retransmission timeouts have a significant 481 impact on the throughput of the protocol. 483 Note that when p=0, the equation yields 1/T = M = 0. This does not 484 constitute a problem as we can still compare the M values computed for 485 different receivers to determine the acker. Also note that it is easier 486 to compute M^2 instead of M, because the former does not require the use 487 of sqrt(). 489 In future, different throughput equations may be substituted for this 490 equation. The requirement is that the throughput equation be a 491 reasonable approximation of the sending rate of TCP for conformant TCP 492 congestion control. 494 The parameters p and RTT need to be measured or calculated by a PGMCC 495 implementation. The measurement of RTT is specified in Section 2.5; the 496 measurement of p is specified in Section 2.6. 498 2.5. RTT measurement 500 In PGMCC, the RTT is measured by the sender making use of the timestamp 501 (or equivalent information) echoed back by each receiver in feedback 502 messages. Three procedures are possible to measure the RTT, as follows. 503 In no case is it required to have clock synchronization between sender 504 and receivers. 506 2.5.1. Explicit Timestamp 508 This first technique relies on the transmission of a timestamp TS_j with 509 each data packet j. 510 The receiver will record the most recently received timestamp, and will 511 echo it back to the source when generating an ACK or a NAK. If the 512 feedback is delayed, the time elapsed between the reception of the 513 timestamp and the generation of the feedback should be added to the 514 echoed timestamp. 515 The sender computes the RTT by subtracting the received timestamp from 516 the current value of the clock. 518 The resolution of the timestamp value should be good enough for 519 reasonable precision measurement of typical network round trip times. If 520 receivers need to apply correction for delayed feedback, it is necessary 521 that receivers know the resolution of the timestamp clock. A suggested 522 value is 1ms. 524 2.5.2. Implicit timestamp 526 With this technique, the sender will record a timestamp TS_j for each 527 transmitted data packet j, but the timestamp will not be transmitted 528 with the packet itself. 529 The receiver will record the most recently received sequence number, and 530 will echo it back to the source when generating an ACK or a NAK. 531 The sender computes the RTT by looking up the timestamp associated with 532 the sequence number received in the feedback packet, and subtracting it 533 from the current clock value. 535 If the feedback from the receiver is delayed, as it is commonly the case 536 for NAKs, the receiver can compute, and send back to the source, a 537 correction term corresponding to the time elapsed between the reception 538 of the timestamp and the generation of the feedback. The correction term 539 will then be subtracted by the sender in order to obtain the correct 540 estimate of the RTT. 542 This RTT measurement technique is equivalent to the previous one, but it 543 saves some space in data packets as the timestamp does not need to be 544 sent explicitly. Feedback packets might become larger if the correction 545 value is transmitted explicitly; but in many cases, the sequence number 546 will already be present for other reasons (e.g. ACK packets), and 547 wherever space is a concern the sequence number and the correction term 548 can be packed in a single 32-bit word without loss of precision. 550 2.5.3. Sequence numbers 552 This technique is the least precise, but it does not rely on the 553 presence of a high resolution clock on the nodes. 554 The sender will not compute any timestamp, and just send data packets 555 with their sequence number j. 556 The receiver will record the most recently received sequence number, and 557 will echo it back to the source when generating an ACK or a NAK. 558 The sender computes the RTT as the difference between the most recently 559 sent sequence number and the sequence number received from the ACK or 560 NAK packet. 562 Note that in this case the RTT is not measured in seconds, but in 563 "sequence numbers", which are monotonically, but not uniformly, 564 increasing with time. The two measurements are equivalent if the sender 565 transmits at a constant rate. When the data rate changes over time (as 566 it is normally happens, given that PGMCC controls the actual data rate), 567 then the "measured" RTT values grow with the actual transmit rate. This 568 can influence the correctness of the results when comparing two 569 measurement done over different and only partially overlapping time (and 570 sequence number) intervals where the transmit rate incurs a significant 571 change. 573 2.5.4. Recommendations 575 Whenever possible, the measurement of the RTT should be carried out 576 using either explicit or implicit timestamps, and by keeping track of 577 the "correction term" (the delay between data reception and feedback 578 generation). 580 If the receiver does not have a clock with a suitable resolution, the 581 correction term might not be present (or be inaccurate). In this case 582 the timestamps received by the sender on NAK packets might be in error, 583 in the worst case, by as much as the packet interarrival time. This 584 error will normally not be present on ACK packets, which are sent 585 immediately. A suitable correction should be applied by the sender in 586 order to avoid systematic errors. 588 The measurement based on sequence numbers is less accurate, but also 589 less sensitive to errors due to the lack of the correction term. In 590 fact, the measurement error induced by the lack of the correction term 591 can be at most one unit. This suggests that, when the correction term 592 is not available, measurements based on sequence numbers should be 593 favoured. Simulations have shown that the acker selection mechanism 594 performs moderately better when the RTT measurement is based on 595 timestamps, but performance is reasonably good also with measurements 596 based on sequence numbers. 598 2.6. Loss rate measurement 600 The loss measurement in PGMCC is entirely performed by receivers. The 601 measurement results do not directly influence the transmit rate, but are 602 only used for comparison purposes. As a consequence, the scheme is 603 reasonably robust to different measurement techniques, as long as they 604 are not influenced too strongly by single loss events. 606 The main method suggested for loss measurement is Exponentially Weighted 607 Moving Average (EWMA), which is formally equivalent to a single-pole 608 digital low pass filter applied to a binary signal x_i, where x_i = 1 if 609 packet i is lost, x_i = 0 if packet i is successfully received. 611 The loss rate p_i upon reception or detection of loss of packet i is 612 computed as 614 p_i = c_p * p_{i-1} + (1 - c_p ) * p_i 615 where the constant c_p between 0 and 1 is related to the bandpass of 616 the filter. Experiments have shown good performance with c = 500/65536, 617 and computations performed with fixed point arithmetic and 16 fractional 618 digits. 620 As an alternative to EWMA, the technique used in TFRC [2] can be 621 adopted. Simulations have shown a moderate improvement in the acker 622 selection mechanism by measuring loss using the TFRC loss estimator, 623 which is however slightly more expensive to compute than the EWMA loss 624 estimator in presence of packet reordering. 626 2.7. Timeouts 628 When a packet is transmitted, the sender schedules a timeout to prevent 629 stalls upon loss of ACKs or disconnection of the ACKER. In TCP, which 630 has a similar problem, the timeout value is computed by accumulating 631 statistics (SRTT and RTTVAR) on RTT samples, starting from a default 632 initial value (3s) when no RTT samples are available. 634 PGMCC can use a similar scheme to compute the timeouts, remembering that 635 upon ACKER changes (which may be very frequent), the computation of SRTT 636 and RTTVAR must be restarted from the beginning, unless the sender 637 decides to keep state for at least a small number of recent ackers. 639 Because the ACKER can leave the group without notifying the sender, 640 after a number of successive timeouts the sender MUST force the election 641 of a new ACKER. We recommend this new election to be performed after 642 two successive timeouts. 644 2.8. Interaction with feedback suppression schemes 646 Several schemes are used by NAK-based multicast protocols to reduce the 647 amount of feedback directed toward the source and make the protocol 648 scale with large populations of receivers. Such schemes typically rely 649 on randomly delaying NAK generation, and suppressing pending NAKs when 650 an equivalent NAK or a retransmission is heard; or, intermediate nodes 651 such as routers can implement some form of feedback aggregation and 652 filtering. 654 Such schemes might prevent NAKs from potential ACKER candidates from 655 reaching the source. This filtering might impact the speed at which 656 PGMCC selects the correct ACKER, though initial experience from 657 simulations seem to suggest that PGMCC behavior is not severely affected 658 by NAK suppression schemes. 660 2.9. Interaction with ECN 662 PGMCC can use ECN notifications in much the same way as actual losses, 663 and use such notifications to control the throughput of the session. 665 At the receiver, ECN-marked data packets can be considered as lost 666 packets for the purpose of loss rate computation and ACK/NAK generation. 667 If the ACKER sends an ACK for ECN-marked packets, that ACK MUST report 668 that the packet being acknowledged that was ECN marked. Similarly the 669 ACKER must indicate in the ACK packet's received packets bitmap that the 670 packet was ECN-marked, or that the packet was lost. 672 We note that to support use of the ECN nonce, the ACK packet's received 673 packets bitmap would require two bits per packet being reported. 675 3. Procedures - Sender 677 The following pseudo-code specifies the complete behavior of the sender 678 in PGMCC. 680 initialization: 681 T = 1 ; W = 1 ; /* initialize window and number of tokens */ 682 RETRY = 0 ; /* number of consecutive timeouts so far */ 683 < initialize p, RTT for acker to default values > 684 ACKER = NO_ACKER; /* no acker is known */ 685 < initialize sequence numbers > 686 QUEUED = 0; /* packets waiting to be transmitted */ 688 on transmission request: 689 send_packet() ; 691 on timeout expiration : 692 T = 1 ; W = 1 ; /* initialize window and number of tokens */ 693 if (RETRY < RETRY_MAX) 694 RETRY = RETRY + 1 695 else 696 ACKER = NO_ACKER ; /* old acker is not valid anymore */ 697 send_packet() ; 699 on ACK/NAK reception from receiver I : 700 < compute p and RTT for source of this ACK, see Sec. 2.5 and 2.6 > 701 RETRY = 0 ; 702 if (ACKER == NO_ACKER) { /* select current as acker is no other known */ 703 ACKER = I ; 704 T = T + 1 ; 705 } 706 if (ACKER != I) 707 < select acker according to Sec. 2.3 > ; 708 else { 709 710 if (packet_type == ACK) { 711 < update_window according to Sec.2.2 > 712 send_packet ; 713 if (ack_pending) 714 update_timeout ; 715 } 716 } 718 send_packet() { 719 if (QUEUED > 0 && T >= 1) { 720 < transmit one packet > 721 T = T - 1 ; 722 QUEUED = QUEUED - 1 ; 723 } 724 if ( ) 725 726 } 728 4. Procedures -- Receiver 730 The following pseudo-code specifies the complete behavior of the 731 receiver in PGMCC. 733 A receiver only transmits an ACK packet when it receives a data packet 734 for which the receiver is designated as the ACKER by the data packet 735 itself. A receiver can transmit a NAK packet after it has detected that 736 a data packet is missing and a suitable delay has passed, as dictated by 737 the feedback suppression rules of the protocol in use. 739 The data packet contains acknowledgement status about the most recent 32 740 sequence numbers known to the receiver. 742 on initialization/session setup: 743 < initialize state variables and ACK bitmap > 745 on DATA packet reception: 746 < update p measurement according to Sec.2.6 > 747 < record timestamp and packet reception time > 748 if (ACKER == this_node) { 749 < send an immediate ACK > 750 } 751 if ( ) 752 < schedule a timeout for NAK transmission > 754 on NAK reception: 755 < suppress any pending NAK transmission for the sequence 756 number indicated in the NAK > 758 on timeout: 759 if ( < there are missing and unacknowledged packets > ) { 760 < send a NAK for one or more of the missing packets > 761 < mark such packets as acknowledged > 762 if ( ) 763 < schedule a timeout for NAK transmission > 764 } 766 5. Security Considerations 768 PGMCC is not a transport protocol in its own right, but a congestion 769 control mechanism that is intended to be used in conjunction with a 770 transport protocol. Therefore security primarily needs to be considered 771 in the context of a specific transport protocol and its authentication 772 mechanisms. 774 Congestion control mechanisms can potentially be exploited to create 775 denial of service. This may occur through spoofed feedback. Thus any 776 transport protocol that uses PGMCC should take care to ensure that 777 feedback is only accepted from the receiver of the data. The precise 778 mechanism to achieve this will however depend on the transport protocol 779 itself. 781 In addition, congestion control mechanisms may potentially be 782 manipulated by a greedy receiver that wishes to receive more than its 783 fair share of network bandwidth. A receiver might do this by first 784 reporting inflated loss and RTT samples, in order to get selected as the 785 ACKER, and then generating ACK at the desired rate (including possibly 786 claiming to have received packets that in fact were lost due to 787 congestion). Possible defenses against such a receiver could be based 788 on the sender verifying the correctness of the loss and RTT samples 789 supplied by the receiver. A PGMCC sender SHOULD compare the receiver 790 reports on loss rate and RTT with the information derived directly from 791 the incoming stream of ACKs. In case of discrepancy of the reports, a 792 PGMCC sender SHOULD mark the current acker as ineligible and initiate a 793 new acker election. The decision on how large that discrepancy should be 794 before initiating a new acker election is left to the implementation. 796 Also, the sender MAY include some form of nonce that the receiver must 797 feed back to the sender to prove receipt. However, the details of such a 798 nonce would depend on the transport protocol, and in particular on 799 whether the transport protocol is reliable or unreliable. 801 6. Authors' Addresses 803 Luigi Rizzo 804 luigi@iet.unipi.it 805 Dip. Ing. dell'Informazione, 806 Univ. di Pisa 807 via Diotisalvi 2, 56122 Pisa, Italy 809 Gianluca Iannaccone 810 gianluca.iannaccone@intel.com 811 Intel Research 812 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK 814 Lorenzo Vicisano 815 lorenzo@cisco.com 816 cisco Systems, Inc. 817 170 West Tasman Dr., 818 San Jose, CA, USA, 95134 820 Mark Handley 821 m.handley@cs.ucl.ac.uk 822 University College London, 823 Gower Street, London WC1E 6BT, UK 825 7. Acknowledgments 827 We would like to acknowledge feedback and discussions on equation-based 828 congestion control with a wide range of people, including members of the 829 Reliable Multicast Research Group, the Reliable Multicast Transport 830 Working Group, and the End-to-End Research Group. 832 [1] Bradner, S., Key words for use in RFCs to Indicate Requirement 833 Levels (IETF RFC 2119) http://www.rfc-editor.org/rfc/rfc2119.txt 835 [2] Floyd, S., Handley, M., Padhye, J., Widmer, J., "Equation-Based 836 Congestion Control for Unicast Applications", ACM SIGCOMM 2000, 837 Stockholm, Aug. 2000 839 [3] Padhye, J. and Firoiu, V. and Towsley, D. and Kurose, J., "Modeling 840 TCP Throughput: A Simple Model and its Empirical Validation", Proc ACM 841 SIGCOMM 1998. 843 [4] Mankin, A., Romanow, A., Brander, S., Paxson, V., "IETF Criteria for 844 Evaluating Reliable Multicast Transport and Application Protocols," 845 RFC2357, June 1998. 847 [5] Rizzo, L., "pgmcc: a TCP-friendly single-rate multicast congestion 848 control scheme", ACM SIGCOMM 2000, Stockholm, Aug.2000 850 [6] Schulzrinne, H., Casner, S., Frederick, R., Jacobson, V., "RTP: A 851 Transport Protocol for Real-Time Applications", RFC 1889, Jan 1996. 853 [7] Speakman, T., Crowcroft, J., Gemmell, J., Farinacci, D. , Lin, S., 854 Leshchiner, D., Luby, M., Montgomery, T. , Rizzo, L., Tweedly, A., 855 Bhaskar, N., Edmonstone, R., Sumanasekera, R., Vicisano, L., PGM 856 Reliable Transport Protocol Specification, RFC 3208, December 2001. 857 rfc3208.txt also available at ftp://ftp.rfc-editor.org/in- 858 notes/rfc3208.txt 860 8. Full Copyright Statement 862 Copyright (C) The Internet Society (2000). All Rights Reserved. 864 This document and translations of it may be copied and furnished to 865 others, and derivative works that comment on or otherwise explain it or 866 assist in its implementation may be prepared, copied, published and 867 distributed, in whole or in part, without restriction of any kind, 868 provided that the above copyright notice and this paragraph are included 869 on all such copies and derivative works. However, this document itself 870 may not be modified in any way, such as by removing the copyright notice 871 or references to the Internet Society or other Internet organizations, 872 except as needed for the purpose of developing Internet standards in 873 which case the procedures for copyrights defined in the Internet 874 languages other than English. 876 The limited permissions granted above are perpetual and will not be 877 revoked by the Internet Society or its successors or assigns. 879 This document and the information contained herein is provided on an "AS 880 IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK 881 FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT 882 LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT 883 INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR 884 FITNESS FOR A PARTICULAR PURPOSE."