idnits 2.17.1 draft-agache-tcpm-sndbufadv-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 20, 2015) is 3175 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- No issues found here. Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force A. Agache 3 Internet-Draft C. Raiciu 4 Intended status: Experimental University Politehnica of Bucharest 5 Expires: January 21, 2016 July 20, 2015 7 TCP Sendbuffer Advertising 8 draft-agache-tcpm-sndbufadv-00 10 Abstract 12 Network operators have difficulty in understanding the end-to-end 13 performance of TCP connections through their networks. By observing 14 packets at different vantage points on their path and maintaining per 15 flow state, network operators can detect packet losses, 16 retransmission and estimate RTTs, among other metrics. A key 17 information needed by networks is whether a connection is limited by 18 the network or by the application. This information is very 19 difficult to accurately infer by passive measurements. 21 We propose to advertise sendbuffer occupancy in TCP: each segment 22 will carry the amount of backlogged data present in the sender's 23 buffer. This information allows networks to discern between 24 application-limited, network-limited and flow-control limited flows, 25 creating new avenues of network optimization. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on January 21, 2016. 44 Copyright Notice 46 Copyright (c) 2015 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Requirements Language . . . . . . . . . . . . . . . . . . . . 2 62 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 63 3. TCP Sendbuffer Structure . . . . . . . . . . . . . . . . . . 3 64 4. Negotiating sendbuffer advertising . . . . . . . . . . . . . 4 65 5. Encoding sendbuffer information . . . . . . . . . . . . . . . 5 66 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 6 67 6.1. Normative References . . . . . . . . . . . . . . . . . . 6 68 6.2. Informative References . . . . . . . . . . . . . . . . . 6 69 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 7 71 1. Requirements Language 73 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 74 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 75 document are to be interpreted as described in RFC 2119 [RFC2119]. 77 2. Introduction 79 Aggregate link statistics, such as packet and loss counts, are easily 80 available in modern networks, but they convey a fairly limited 81 picture of network performance. In many cases, the network needs 82 information about individual flows' demand for bandwidth to take the 83 appropriate resource allocation decisions. 85 One example is a mobile phone streaming audio or video over a WiFi 86 connection. The default strategy is to always stick to WiFi when 87 available, despite the fact that performance may be terrible and 88 seriously impair user experience. If the mobile network knew the 89 multimedia stream needs more bandwidth, it could fire-up the cellular 90 connection and migrate traffic over there by using mobile client 91 offloading software relying on Multipath TCP [NSDI-12] or Mobile IP 92 [RFC5944]. 94 Another example is in datacenters with Clos topologies (such as the 95 popular FatTree topology [FatTree]), where elephant flows are 96 randomly placed on paths with flow-level Equal Cost Multipath 97 Routing; when one or more elephant flows are placed on the same link, 98 performance degrades despite existing capacity elsewhere in the 99 network. The network can reroute such flows by using tunnels or 100 programmable switches (e.g. Openflow) but the one thing missing is 101 the information regarding which flows could utilize more capacity if 102 given a better path. 104 Determining if a TCP connection is network limited or not is 105 difficult to do by passive monitoring. The network needs to keep 106 per-flow state, to estimate the sender congestion window and to 107 accurately monitor flight-size. When flight-size is smaller than the 108 congestion window and the receive window, the connection is limited 109 by the application and does not need more capacity. 111 We propose that each TCP segment should also encode the amount of 112 backlogged data in the TCP sendbuffer. This information enables 113 network boxes and receivers to easily identify connections that need 114 more capacity. Our goal is to have this extension "always on", and 115 it is therefore very important to reduce its overhead. Next, we 116 discuss how to compute and report the amount of backlogged data. We 117 follow with a discussion of signaling options for conveying 118 sendbuffer information. 120 3. TCP Sendbuffer Structure 122 1 2 123 ---|----------|----------|---> 124 SND.UNA SND.NXT WRITE.SEQ 126 1 - sequence numbers of unacknowledged, in flight data 127 2 - sequence numbers of backlogged data. 129 Anatomy of the TCP Sendbuffer 131 The figure above shows the anatomy of the TCP sendbuffer. SND.UNA 132 represents the oldest sequence number sent but not yet acknowledged. 133 At the other end there is WRITE.SEQ, the tail sequence number of data 134 held in the sendbuffer. Somewhere in-between we have SND.NXT, the 135 sequence number of the next byte to be sent. From SND.NXT to 136 WRITE.SEQ we have backlogged data, written by the application but not 137 yet transmitted. 139 SND.NXT is constrained by both the receive window and the congestion 140 window as follows: 142 SND.NXT <= SND.UNA + min(SND.WND, SND.CWND) 144 As long as the receive window is not a bottleneck, and in the absence 145 of hardware issues or software bugs, having SND.NXT smaller than 146 WRITE.SEQ indicates that the congestion window is not large enough, 147 so the connection is network limited at that point in time. The 148 easiest way to implement sendbuffer advertising is to simply copy the 149 amount of backlogged data (WRITE.SEQ-SND.NXT) into the segment when 150 it leaves the TCP stack. However, this will result in non-zero 151 sendbuffer advertisement when the connection is application-limited 152 but the application writes bursts of a few packets. These packets 153 will be sent out immediately on the wire, yet the first packets in 154 the burst will report that the application is backlogged, when in 155 fact it isn't. 157 To correctly implement sendbuffer advertisement, the sender MUST 158 advertise the amount of backlogged according to the formula below: 160 SEG.SNDBUF = WRITE.SEQ-SND.UNA - min(SND.WND, SND.CWND), 161 if WRITE.SEQ > SND.UNA + min(SND.WND, SND.CWND) 163 SEG.SNDBUF = 0, otherwise 165 This formula ensures that if an application write fits in the current 166 receive and congestion windows, all the resulting segments will 167 advertise zero backlog data. 169 4. Negotiating sendbuffer advertising 171 The standard way to extend TCP is to negotiate the extension during 172 the three-way handshake. The TCP option space, however, is already 173 very crowded in the SYN exchange. Until solutions that extend the 174 TCP option space are standardized, negotiation in the SYN exchange 175 is, in our view, not a feasible option for sendbuffer advertising. 177 Fortunately, sendbuffer advertising is a sender-side only 178 modification to TCP, and the information it makes available can be 179 used anyone that understands it, be it the network or the receiver. 180 This implies that we can simply bypass the three way handshake as 181 long as the actual encoding of the sendbuffer information in TCP 182 segments does not have negative effects to legacy routers, 183 middleboxes and TCP receivers. We discuss encoding in the next 184 section. 186 TCP sendbuffer advertising will therefore be a simple sender-only 187 enhancement to the TCP stack that can be enabled by using system-wide 188 configuration (e.g. sysctl in Linux). 190 5. Encoding sendbuffer information 192 In this section we discuss two encoding alternatives for sendbuffer 193 information: as new TCP options, in the acknowledgement field of data 194 segments and in the receive-window field. 196 The first solution is to simply encode sendbuffer information in a 197 new TCP option on every segment carrying data in a TCP connection, 198 without negotiating this extension in the three way handshake. This 199 only adds 6B of overhead to each TCP segment. This option is 200 feasible only when there is sufficient space in the TCP option field 201 of the corresponding data segment. 203 Avoding the option negotiation will work really well in datacenters 204 where it can be ensured out-of-band that all machines either know 205 sendbuffer advertising or are unaffected by segments carrying new 206 options. In the Internet, before advertising sendbuffer information 207 in new TCP options we need to ensure that: a) existing TCP stacks are 208 robust to unknown options, simply ignoring them, and b) middleboxes 209 do not drop segments carrying unknown options. Existing studies 210 [IMC-11] imply that the wide majority of network paths either allow 211 unknown options or drop the options, allowing the segments through. 212 Only a very small fraction of paths drop the segments with unknown 213 options. To cope with such cases, the implementation MUST NOT 214 include sendbuffer information on retransmitted packets, to ensure 215 that the connection makes some progress even in the presence of such 216 middleboxes. 218 Our second solution is based on the observation that while TCP itself 219 is bidirectional, most connections in practice will transfer data 220 unidirectionally most times. The endpoints can be either data 221 senders or receivers at different moments, but they rarely act as 222 both at the same time. When traffic is unidirectional, the sender 223 sends the same value for the acknowledgement number and receive 224 window field over and over again. 226 We propose to reuse one or both of these fields to advertise 227 sendbuffer information instead when traffic is unidirectional. To 228 detect unidirectional traffic, the sender will maintain a state 229 variable called SND.NUM_SEG that is initially set to zero, and is 230 zeroed whenever a segment with a valid ACK field is sent out. 231 SND.NUM_SEG will be incremented whenever a segment is received. A 232 sendbuffer advertisment SHOULD be encoded in outgoing segments only 233 when SND.NUM_SEG = 0. 235 Sendbuffer advertising will encode the proper value in the ACK field 236 and NOT set the ACK flag. This ensures the receiver and other on- 237 path hosts will ignore the field altogether. We still need, however, 238 to inform parties interested in sendbuffer information they can use 239 the value of the ACK field. 241 In datacenters, we can simply define one of the reserved TCP flags as 242 the sendbuffer advertisement flag. When this flag is set, the 243 sendbuffer value is encoded in the ACK field. The sendbuffer 244 advertisement flag and the ACK flag CANNOT be set simultaneously. 246 In the Internet, redefining the meaning of one of the reserved flags 247 will simply not work through existing middleboxes; additionally, 248 certain middleboxes may zero the ACK field when the ACK flag is not 249 set. In this context, we propose to use the receive window field in 250 segments carrying sendbuffer information to encode a checksum of this 251 information. Interested parties will: a) scan for data segments with 252 the ACK flag not set, b) compute a 1's complement checksum of the ACK 253 field and check it against the receive window field. In case of a 254 match, the sendbuffer information can be used. To understand the 255 feasibility of this encoding, however, tests must to be conducted to 256 check the behaviour of middleboxes when the ACK flag is not set. 258 6. References 260 6.1. Normative References 262 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 263 Requirement Levels", BCP 14, RFC 2119, March 1997. 265 6.2. Informative References 267 [FatTree] Al-Fares, M., Loukissas, A., and A. Vahdat, "A scalable, 268 commodity data center network architecture", 2008, 269 . 271 [IMC-11] Honda, M., Nishida, Y., Raiciu, C., Greenhalgh, A., 272 Handley, M., and H. Tokuda, "Is it still possible to 273 extend tcp?", 2011, 274 . 276 [NSDI-12] Raiciu, C., Paasch, C., Barre, S., Ford, A., Honda, M., 277 Duchene, F., Bonaventure, O., and M. Handley, "How hard 278 can it be? designing and implementing a deployable 279 multipath tcp", 2012, 280 . 282 [RFC5944] Perkins, C., "IP Mobility Support for IPv4, Revised", 283 RFC 5944, November 2010. 285 Authors' Addresses 287 Alexandru Agache 288 University Politehnica of Bucharest 289 Splaiul Independentei 313 290 Bucharest 291 Romania 293 Email: alexandru.agache@cs.pub.ro 295 Costin Raiciu 296 University Politehnica of Bucharest 297 Splaiul Independentei 313 298 Bucharest 299 Romania 301 Email: costin.raiciu@cs.pub.ro