idnits 2.17.1 draft-scheffenegger-tcpm-timestamp-negotiation-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC1323, updated by this document, for RFC5378 checks: 1992-05-01) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 16, 2012) is 4302 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 1323 (Obsoleted by RFC 7323) == Outdated reference: A later version (-10) exists of draft-ietf-ledbat-congestion-09 == Outdated reference: A later version (-21) exists of draft-ietf-tcpm-1323bis-03 == Outdated reference: A later version (-01) exists of draft-sabatini-tcp-sack-00 -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) -- Obsolete informational reference (is this intentional?): RFC 6013 (Obsoleted by RFC 7805) Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance and Minor Extensions R. Scheffenegger 3 (tcpm) NetApp, Inc. 4 Internet-Draft M. Kuehlewind 5 Updates: 1323 (if approved) University of Stuttgart 6 Intended status: Experimental July 16, 2012 7 Expires: January 17, 2013 9 Additional negotiation in the TCP Timestamp Option field 10 during the TCP handshake 11 draft-scheffenegger-tcpm-timestamp-negotiation-04 13 Abstract 15 A number of TCP enhancements in so diverse fields as congestion 16 control, loss recovery or side-band signaling could be improved by 17 allowing both ends of a TCP session to interpret the value carried in 18 the Timestamp option. Further enhancements are enabled by changing 19 the receiver side processing of timestamps in the presence of 20 Selective Acknowledgements. 22 This documents updates RFC1323 and specifies a backwards compatible 23 way of negotiating for Timestamp capabilities, and lists a number of 24 benefits and drawbacks of this approach. 26 Status of this Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at http://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on January 17, 2013. 43 Copyright Notice 45 Copyright (c) 2012 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5 62 3. Overview of the TCP Timestamp Option . . . . . . . . . . . . . 6 63 4. Extended Timestamp Capabilities . . . . . . . . . . . . . . . 7 64 4.1. Description . . . . . . . . . . . . . . . . . . . . . . . 7 65 4.2. Timestamp clock interval exposure . . . . . . . . . . . . 8 66 4.3. Timestamp echo update for Selective Acknowledgments . . . 8 67 5. Timestamp capability signaling and negotiation . . . . . . . . 10 68 5.1. Capability Flags . . . . . . . . . . . . . . . . . . . . . 10 69 5.2. Timestamp clock interval encoding . . . . . . . . . . . . 12 70 5.3. Negotiation error detection and recovery . . . . . . . . . 15 71 5.4. Interaction with Selective Acknowledgment . . . . . . . . 17 72 5.4.1. Interaction with the Retransmission Timer . . . . . . 18 73 5.4.2. Interaction with the PAWS test . . . . . . . . . . . . 19 74 5.5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . 19 75 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 20 76 7. Updates to Existing RFCs . . . . . . . . . . . . . . . . . . . 20 77 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 78 9. Security Considerations . . . . . . . . . . . . . . . . . . . 22 79 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 80 10.1. Normative References . . . . . . . . . . . . . . . . . . . 22 81 10.2. Informative References . . . . . . . . . . . . . . . . . . 22 82 Appendix A. Pseudo Code examples . . . . . . . . . . . . . . . . 24 83 A.1. Sender . . . . . . . . . . . . . . . . . . . . . . . . . . 24 84 A.2. Receiver . . . . . . . . . . . . . . . . . . . . . . . . . 25 85 Appendix B. Possible use cases . . . . . . . . . . . . . . . . . 26 86 B.1. One-way delay variation measurement . . . . . . . . . . . 26 87 B.2. Timestamp clock rate exposure . . . . . . . . . . . . . . 28 88 B.3. Early spurious retransmit detection . . . . . . . . . . . 28 89 B.4. Early lost retransmission detection . . . . . . . . . . . 29 90 B.5. Integrity of the Timestamp value . . . . . . . . . . . . . 31 91 B.6. Disambiguation with slow Timestamp clock . . . . . . . . . 31 92 B.7. Masked timestamps as segment digest . . . . . . . . . . . 32 93 Appendix C. Revision history . . . . . . . . . . . . . . . . . . 33 94 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34 96 1. Introduction 98 The Timestamp option originally introduced in [RFC1323] was designed 99 to support only two very specific mechanisms, round trip time 100 measurement (RTTM), and protection against wrapped sequence numbers 101 (PAWS), assuming a particular TCP algorithm (Reno). The current 102 semantics inhibit the use of the Timestamp option for other uses. 103 Taking advantage of developments since TCP Reno, in particular 104 Selective Acknowledgements (SACK) [RFC2018] allow different 105 semantics, which in turn enable new uses for the Timestamp option, 106 either for timing purposes (e.g. one-way delay variation measurement 107 in the context of congestion control), or as unique token (e.g. for 108 improved loss recovery). 110 This specification defines a protocol for the two ends of a TCP 111 session to negotiate alternative semantics of the Timestamp option 112 fields they will exchange during the rest of the session. It updates 113 RFC1323 but it is backwards compatible with implementations of 114 RFC1323 Timestamp options, and allows gradual deployment. 116 The RFC1323 timestamp protocol presents the following problems when 117 trying to extend it for alternative uses: 119 a. Unclear meaning of the value in a timestamp. 121 * A timestamp value (TSval) as defined in [RFC1323] is 122 deliberately only meaningful to the end that sends it. The 123 other end is merely meant to echo the value without 124 understanding it. This is fine if one end is trying to 125 measure two-way delay (round trip time). However, to measure 126 one-way delay variation, timestamps from both ends need to be 127 compared by one end, which needs to relate the values in 128 timestamps from both ends to a notion of the passage of time 129 that both ends share. 131 b. No control over which timestamp to echo. 133 * A host implementing [RFC1323] is meant to echo the timestamp 134 value of the most recent in-order segment received. This was 135 fine for TCP Reno, but it is not the best choice for TCP 136 sessions using selective acknowledgement (SACK) [RFC2018]. 138 * A [RFC1323] host is meant to echo the timestamp value of the 139 earliest unacknowledged segment, e.g. if a host delays ACKs 140 for one segment, it echoes the first timestamp not the second. 141 It is desirable to include delay due to ACK withholding when a 142 host is conservatively measuring RTT. However, is not useful 143 to include the delay due to ACK withholding when measuring 144 one-way delay variation. 146 c. Alternative protection against wrapped sequence numbers. 148 * [RFC1323] also points out that the timestamps it specifies 149 will always strictly monotonically increase in each window so 150 they can be used to protect against wrapped sequence numbers 151 (PAWS). If the endpoints negotiate an alternative timestamp 152 scheme in which timestamps may not monotonically increase per 153 window, then it needs to be possible to negotiate alternative 154 protection against wrapped sequence numbers. 156 To solve these problems this specification changes the wire protocol 157 of the TCP timestamp option in two main ways: 159 1. It updates [RFC1323] to add the ability to negotiate the 160 semantics of timestamp options. The initiator of a TCP session 161 starts the negotiation in the TSecr field in the first , 162 which is currently unused. This specification defines the 163 semantics of the TSecr field in a segment with the SYN flag set. 164 A version number is included to allow further extension of 165 capability negotiation in future. 167 2. A version independent ability to mask a specified number of the 168 lower significant bits of the timestamp values is present. These 169 masked bits are not considered for timestamp calculations, or in 170 an algorithm to protect against wrapped sequence numbers. Future 171 extensions can thereby change the timestamp signaling without 172 changing the modified treatment on the receiver side. 174 3. It updates [RFC1323] to define version 0 of timestamp 175 capabilities to include: 177 * the duration in seconds of a tick of the timestamp clock using 178 a time interval representation 180 * agreement that both ends will echo the timestamp on the most 181 recently received segment, rather than the one that would be 182 echoed by an [RFC1323] host. There is no specific option to 183 request this behavior, however it is implied by successful 184 negotiation of both SACK and timestamp capabilities. 186 With this new wire protocol, a number of new use-cases for the TCP 187 timestamp option become possible. Appendix B gives some examples. 188 Further extensions might be required in future. Two possible ways to 189 extend the negotiation capabilities are mentioned, one maintaining 190 some of the semantics specified herein, and a incompatible extension 191 to allow for other semantics. 193 2. Terminology 195 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 196 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 197 document are to be interpreted as described in [RFC2119]. 199 The reader is expected to be familiar with the definitions given in 200 [RFC1323]. 202 Further terminology used within this document: 204 Timestamp clock interval 205 The Timestamp value is derived from a clock source running at a 206 reasonable constant frequency. The interval between two ticks of 207 that clock is signaled during the timestamp capability 208 negotiation. Note that the timestamp clock is not required to be 209 identical with the TCP clock, even though most implementations 210 use the same clock for practical purposes. 212 Timestamp option 213 This refers to the entire TCP timestamp option, including both 214 TSval and TSecr fields. 216 Timestamp capabilities 217 Refers only to the values and bits carried in the TSecr field of 218 and segments during a TCP handshake. For 219 signaling purposes, the timestamp capabilities are sent in clear 220 with the segment, and in an encoded form (see Section 5 for 221 details) in the segment. 223 3. Overview of the TCP Timestamp Option 225 The TCP Timestamp option (TSopt) provides timestamp echoing for 226 round-trip time (RTT) measurements. TSopt is widely deployed and 227 activated by default in many systems. [RFC1323] specifies TSopt the 228 following way: 230 Kind: 8 232 Length: 10 bytes 234 +-------+-------+---------------------+---------------------+ 235 |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| 236 +-------+-------+---------------------+---------------------+ 237 1 1 4 4 239 Figure 1: RFC1323 TSopt 241 "The Timestamps option carries two four-byte timestamp fields. 242 The Timestamp Value field (TSval) contains the current value of 243 the timestamp clock of the TCP sending the option. 245 The Timestamp Echo Reply field (TSecr) is only valid if the ACK 246 bit is set in the TCP header; if it is valid, it echos a times- 247 tamp value that was sent by the remote TCP in the TSval field of a 248 Timestamps option. When TSecr is not valid, its value must be 249 zero. The TSecr value will generally be from the most recent 250 Timestamp option that was received; however, there are exceptions 251 that are explained below. 253 A TCP may send the Timestamps option (TSopt) in an initial 254 segment (i.e., segment containing a SYN bit and no ACK bit), and 255 may send a TSopt in other segments only if it received a TSopt in 256 the initial segment for the connection." 258 The comparison of the timestamp in the TSecr field to the current 259 timestamp clock gives an estimation of the two-way delay (RTT). With 260 [RFC1323] the receiver is not supposed to interpret the TSval field 261 for timing purposes, e.g. one-way delay variation measurements, but 262 only to echo the content in the TSecr field. [RFC1323] specifies 263 various cases when more than one timestamp is available to echo. The 264 only property exposed to a receiver is a strict monotonic increase in 265 value, for use with the protection against wrapped sequence numbers 266 (PAWS) test. The approach taken by [RFC1323] is not always be the 267 best choice, i.e. when the TCP Selective Acknowledgment option (SACK) 268 is used in conjunction on the same session. 270 4. Extended Timestamp Capabilities 272 4.1. Description 274 Timestamp values are carried in each segment if negotiated for. 275 However, the content of these values is to be treated as an unmutable 276 and largely uninterpreted entity by the receiver. The timestamp 277 negotiation should allow for following criteria: 279 o Allow to state timing information explicitly during the initial 280 handshake, avoiding the proliferation of ad-hoc heuristics to 281 determine this information via some other means. 283 o Indicate the (approximate) timestamp clock interval used by the 284 sender in a wide range. The longest interval should be around 10 285 seconds, while the shorted interval should allow unique timestamps 286 per segment, even at extremely high link speeds. 288 o Allow for timestamps that are not directly related to real time 289 (i.e. segment counting, or use of the timestamp value as a true 290 extension of sequence numbers). 292 o Provide means to prevent or at least detect tampering with the 293 echoed timestamp value, allowing for basic integrity and 294 consistency checks. 296 o Allow for future extensions that may use some of the timestamp 297 value bits for other signaling purposes during the remainder of 298 the session. 300 o Signaling must be backwards compatible with existing TCP stacks 301 implementing basic [RFC1323] timestamps. Current methods for 302 timestamp value generation must be supported. 304 o Allow for a means to disambiguate between retransmitted and 305 delayed segments. 307 o Cater for broken implementations of [RFC1323], that either send a 308 non-zero TSecr value in the initial , or a zero TSecr value 309 in . 311 o Provide flexibility to extend the negotiation protocol. 312 Backwards-compatible and incompatible extensions of using 313 timestamps should be available. 315 4.2. Timestamp clock interval exposure 317 The most important new property enabled by the negotiating the 318 timestamp capabilities is the explicit signaling of the timestamp 319 clock interval. This is enabled by using the (unused) TSecr field in 320 the TCP segment, and a simple announcement mechanism without 321 direct feedback from the partner. By extension, passive observation 322 of the TCP handshake will reveal more TCP session parameters 323 explicitly, than can currently be deducted implicitly without certain 324 assumptions. 326 In the returned there is no unused field left, however. In 327 order to remain backwards compatible, a receiver capable of timestamp 328 capability negotiation has to XOR the receivers (local) capabilities 329 flags with the received TSval, before echoing the result back in the 330 TSecr field. During the initial handshake, the sender has to store 331 the initially sent TSval, in order to determine if the receiver can 332 support this timestamp capability negotiation. 334 As the importance of the timestamp option increases by using it in 335 more aspects of a TCP sender's operation e.g. congestion control, and 336 loss recovery, so increases the importance of maintaining the 337 integrity of the reflected timestamps. At the same time this must 338 not inhibit the receiver to interpret a received timestamp in TSval. 340 This is achieved by indicating how many LSB bits of the timestamp 341 value MUST NOT be interpreted by the receiver. Apart from the 342 purpose of maintaining timestamp integrity for the use as input 343 signal into congestion control algorithms, this also allows the use 344 of timestamp based methods to discriminate at the earliest possible 345 moment (within 1 RTT after the retransmission) between spurious 346 retransmissions and genuine loss even when using slow running TCP 347 timestamp clocks. 349 4.3. Timestamp echo update for Selective Acknowledgments 351 In [RFC1323], timing information is only considered in relation to 352 calculating a (conservative) estimate of the round trip time, in 353 order to arrive at a reasonable retransmission timeout (RTO). A 354 retransmission timeout is a very expensive event in TCP, in terms of 355 lost throughput and other metrics. For this reason, a receiver had 356 to follow special rules in what timestamp to echo. This was to never 357 underestimate the actual RTT, even during periods of loss or 358 reordering on either the forward or return path. No other explicit 359 signal could convey the presence of such events back to the sender at 360 the time [RFC1323] was defined. Therefore a receiver had to make 361 sure than at best, the timestamp of the last in-sequence segment 362 would be echoed to the sender. 364 Receivers conforming to [RFC1323] are required to only reflect the 365 timestamp of the last segment that was received in order, or the 366 timestamp of the last not yet acknowledged segment in the case of 367 delayed acknowledgments. 369 When selective acknowledgment (SACK) is enabled on a session, the 370 presence of a SACK option will explicitly signal reordering or loss 371 to the sender. This information can be used to suspend the 372 calculation of updated RTT estimates. As the SACK option will be 373 present in multiple ACKs, this also prevents increasing RTT 374 artificially when some of the ACKs, indicating loss, are dropped on 375 the return path. 377 A receiver supporting the timestamp negotiation mechanism defined in 378 this document MUST immediately reflect the value of TSval in the 379 segment triggering an ACK, when the same session also supports SACK. 381 The rules to update the state variable TS.recent remain the identical 382 to [RFC1323], and TS.recent must be evaluated when performing the 383 PAWS test on the receiver side. 385 By this change of semantics when using the timestamps and selective 386 acknowledgments [RFC2018] in the same session, enhancements in loss 387 recovery are possible by removing any remaining retransmission and 388 acknowledgment ambiguity. See Appendix B for a more detailed 389 discussion. Through the modification to the handling of which 390 timestamp to echo in the receiver, timestamps fulfill the properties 391 of the "token", as described in [I-D.sabatini-tcp-sack]. 393 5. Timestamp capability signaling and negotiation 395 In order to signal the supported capabilities, both the sender and 396 the receiver will independently generate a timestamp capability 397 negotiation field, as indicated below. The TSecr value field of the 398 [RFC1323] TSopt is overloaded with the following flags and fields 399 during the initial and segments. The connection 400 initiator will send the timestamp capabilities in plain, as with 401 [RFC1323] the TSecr is not used in the initial . The receiver 402 will XOR the local timestamp capabilities with the TSval received 403 from the sender and send the result in the TSecr field. The 404 initiating host of a session with timestamp capability negotiation 405 has to keep minimal state to decode the returned capabilities XOR'ed 406 with the sent TSval. 408 5.1. Capability Flags 410 Kind: 8 412 Length: 10 bytes 414 +-------+-------+---------------------+---------------------+ 415 |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| 416 +-------+-------+---------------------+---------------------+ 417 1 1 4 | 4 | 418 / | 419 .-----------------------------------' | 420 / \ 421 | | 422 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 423 |E| | # | 424 |X|VER| MSK # version specific contents | 425 |O| | # | 426 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 428 Figure 2: Timestamp Capability flags 430 Common fields to all versions: 432 EXO - Extended Options (1 bit) 433 Indicates that the sender supports extended timestamp 434 capabilities as defined by this document, and MUST be set to one 435 by a compliant implementation. This flag also enables the 436 immediate echoing of the TSval with the next ACK, if both 437 timestamp capabilities and selective acknowledgement [RFC2018] 438 are successful negotiated during the initial handshake (see 439 Section 4.3, and Section 5.4). This change in semantics is 440 independent of the version in the signaled timestamp 441 capabilities. 443 VER - Version (2 bits) 444 Version of the capabilities fields definition. This document 445 specifies codepoint 0 (00b). With the exception of the immediate 446 mirroring - simplifying the receiver side processing - and the 447 masking of some LSB bits before performing the Protection Against 448 Wrapped Sequence Numbers (PAWS) test, hosts must not interpret 449 the received timestamps and not use a timestamp value as input 450 into advanced heuristics, if the version received is not 451 supported. This is an identical requirement as with current 452 [RFC1323] compliant implementations. 453 The lower 3 octets of the timestamp capability flags MUST be 454 ignored if an unsupported version is received. It is expected, 455 that a host will implement at least version 0. A receiver MUST 456 respond with the appropriate (equal or version 0) version when 457 responding to a new session request. 459 MSK - Mask Timestamps (5 bits) 460 The MaSK field indicates how many least significant bits should 461 be excluded by the receiver, before further processing the 462 timestamp (i.e. PAWS, or for timing purposes). The unmasked 463 portion of a TSval has to comply with the constraints imposed by 464 [RFC1323] on the generation of valid timestamps, e.g. must be 465 monotone increasing between segments, and strict monotone 466 increasing for each TCP window. 467 Note that this does not impact the reflected timestamp in any way 468 - TSecr will always be equal to an appropriate TSval. This field 469 MUST be present in all future version of timestamp capability 470 fields. A value of 31 (all bits set) MUST be interpreted by a 471 receiver that the full TSval is to be ignored by any legacy 472 heuristics, e.g. disabling PAWS. For PAWS to be effective, at 473 least two not masked bits are required to discriminate between an 474 increase (and roll-over) versus outdated segments. 476 5.2. Timestamp clock interval encoding 478 Kind: 8 480 Length: 10 bytes 482 +-------+-------+---------------------+---------------------+ 483 |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| 484 +-------+-------+---------------------+---------------------+ 485 1 1 4 | 4 | 486 / | 487 .-----------------------------------' | 488 / \ 489 | | 490 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 491 |E| | # | | | 492 |X|VER| MSK # RES | ADJ | INT | 493 |O| | # | | | 494 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 496 Figure 3: Timestamp Capability flags - version 0 498 RES - Reserved (8 bits) 499 Reserved for future use, and MUST be zero ("0") with version 0. 500 If timestamp capabilities are received with version set to 0, but 501 some of these bits set, the receiver MUST ignore the extended 502 options field and react as if the TSecr was zero (compatibility 503 mode). Incompatible extensions can use this property, by setting 504 an arbitrary bit of the RES field to one, and VER to zero, to 505 obtain up to 28 bits for other uses. 507 ADJ - Adjustment factor (5 bits) 508 The scaling factor by which the signaled interval has to be left- 509 shifted. This is similar to the way the Window Scale option is 510 defined in [RFC1323]. All values between zero and 31 are valid. 511 This allows timestamp clock ticks of up to 15.99 s. 512 See Table 1 for details. 514 INT - Interval (11 bits) 515 The integer part of the timestamp clock interval can be signaled 516 with up to 11 bits of precision. This allows a range with the 517 highest resolution to cover clock intervals between 7.45 ns 518 (INT=0x400, ADJ=0) and 15.99 s (INT=0x7FF, ADJ=31). 519 Only non-zero values are valid when ADJ is non-zero. An invalid 520 combination of ADJ and INT MUST be treated as if no timestamp 521 capability negotiation is attempted. 522 A compliant sender can choose the value of the TSval in 523 such a way, that either the EXO bit, or some of the RES bits are 524 set, or all the INT bits are cleared, in the encoded response 525 from the receiver. A receiver that does not reflect the initial 526 TSval in its and instead sends a zero value in TSecr, 527 will not erroneously negotiate for timestamp capabilities. 529 The shortest meaningful interval was found to be a 64 byte packet 530 (i.e. ACK segment on Ethernet-like encoding) sent at a rate of 100 531 Gbit/s. This corresponds to a maximum timestamp clock rate of around 532 200 MHz, or an interval between clock ticks of around 5 ns. 534 The wide range of indicated timestamp clock intervals (spanning 9 535 orders of (decimal) magnitude, or 28 binary digits, and the 536 limitation to no more than 24 bits requires the use of a logarithmic 537 encoding. Since the precision of the timestamp clock value is most 538 valuable at low frequencies (long tick durations), the clock rate is 539 encoded as a time duration. This results in full precision for 540 common used timestamp clock intervals, while allowing even shorter 541 intervals at reduced precision. A format was chosen that is simple 542 to implement and poses no risk of confusion with common floating 543 point representations. 545 Conceptually, the timestamp clock interval can be represented as a 546 unsigned integer with 42 bits length. In this form, the least 547 significant bit represents an interval of 2^-38 sec (3.64 ps), while 548 still allowing a maximum interval of 16 sec. This value is then 549 shifted to the right, until it can be represented by only 11 550 significant bits, and the number of shift operations is stored as 551 scaling adjustment factor (ADJ). 553 A value of zero (both ADJ and INT are set to zero) is supported and 554 indicates, that the timestamp values are NOT correlated to wall-clock 555 time (i.e. the sender may perform some form of segment counting or 556 sequence number extension instead). A host receiving an interval of 557 zero from the other end host MUST NOT perform time-based heuristics 558 which take the received TSval into account, but SHOULD apply the 559 regular PAWS test. The sender MUST guarantee the monotonic increase 560 of timestamp values just as with [RFC1323]. 562 Timestamp clock periods faster than 1 ms SHOULD be implemented by 563 inserting the timestamp "late" before transmitting a segment to avoid 564 unnecessary timing jitter. Shortest clock periods, with intervals of 565 only a few microseconds or less, are provided for hardware-assisted 566 implementations, e.g. direct use of a (shifted) CPU cycle counter as 567 clock source. 569 The range of possible values runs from 15.99 s to 7.45 ns with 570 highest precision, and down to 3.64 ps with reducing precision, which 571 is also the shortest difference in tick duration, that could be 572 resolved. This equates to clock frequencies of 0.06 Hz, 134 MHz and 573 275 GHz respectively. 575 Despite the provision of such a large dynamic range, a receiver 576 should consider that a timestamp clock may deviate from the indicated 577 rate by a large fraction. Similarly, a sender SHOULD refrain from 578 signaling the clock interval with too much precision (significant 579 bits), if the clock can not be sampled with low variance over time. 580 This can be achieved by shifting the value more to the right, so that 581 INT will contain a number of leading zero bits. 583 If a sender is using a less precise clock source, fewer significant 584 bits can be used to implicitly signal this. For example, a timestamp 585 clock interval of approximately 1 ms (1/1024th sec) can be 586 represented by both (INT=0x001, ADJ=28) and (INT=0x400, ADJ=18). A 587 more accurate representation of 1 ms would be (INT=0x418, ADJ=18). 588 The latter representation carries more significant bits, indicating a 589 more stable clock source with low jitter. 591 As rough guidance, common software interrupt clock sources should be 592 indicated with 2 leading zero bits in the INT field, as in the 593 examples below. Low variance software clocks (e.g. CPU cycle 594 counters) may be indicated with a single leading zero bit, and 595 hardware injecting the timestamp into the header with high precision 596 may use the full precision. Similarly, if the clock source exhibits 597 a very high variability (e.g. when running in a virtualized 598 environment), 3 or more leading zeros should be used in the INT 599 field. 601 Example for an timestamp capability negotiation, to indicate that the 602 senders timestamp clock (TCP clock) is running with 1 ms per tick, 603 and using a clock source of typical quality (e.g. software timer 604 interrupt): 606 SYN, TSval=, TSecr=EXO|VER=0|MSK=0|RES=0|ADJ=20|INT=0x106 608 For a similar clock source running at 10ms, the fields would look 609 like this: 611 SYN, TSval=, TSecr=EXO|VER=0|MSK=0|RES=0|ADJ=23|INT=0x147 612 +-----------+------------+--------------------+---------------------+ 613 | tick | tick | encoding at | encoding at lowest | 614 | interval | frequency | highest precision | precision | 615 +-----------+------------+--------------------+---------------------+ 616 | 16 s | 0.06 Hz | ADJ=31, INT=0x7FF | ADJ=31, INT=0x7FF | 617 | 1 s | 1 Hz | ADJ=28, INT=0x400 | ADJ=31, INT=0x080 | 618 | 0.5 s | 2 Hz | ADJ=27, INT=0x400 | ADJ=31, INT=0x040 | 619 | 100 ms | 10 Hz | ADJ=24, INT=0x666 | ADJ=31, INT=0x00C | 620 | 10 ms | 100 Hz | ADJ=21, INT=0x51F | ADJ=31, INT=0x001 | 621 | 4 ms | 250 Hz | ADJ=20, INT=0x419 | ADJ=30, INT=0x001 | 622 | 1 ms | 1 kHz | ADJ=18, INT=0x418 | ADJ=28, INT=0x001 | 623 | 200 us | 5 kHz | ADJ=15, INT=0x68E | ADJ=25, INT=0x001 | 624 | 50 us | 20 kHz | ADJ=13, INT=0x68E | ADJ=23, INT=0x001 | 625 | 1 us | 1 MHz | ADJ=8, INT=0x432 | ADJ=18, INT=0x001 | 626 | 60 ns | 16.7 MHz | ADJ=4, INT=0x407 | ADJ=14, INT=0x001 | 627 +-----------+------------+--------------------+---------------------+ 629 Table 1: Common used TCP Timestamp Clock intervals 631 The timestamp clock values a host is using must not necessarily run 632 synchronous with the internal TCP clock. Different clock sources, 633 such as a NTP stratum, RTC, CPU cycle counters, or other independent 634 clocks can be used to derive the TSval. This allows the decoupling 635 of the coarse-grained TCP clock used for retransmission and delayed 636 ACK timeouts, from the clock intervals indicated in the TSval itself. 638 [RFC1323] only defines a single use of the timestamp for timing 639 purposes. Using the TCP timer directly as timestamp clock source is 640 useful for RTT measurement and calculation of the RTO, to minimize 641 subsequent RTT conversations. 643 Most stacks will not support dynamically adjusted timestamp clock 644 intervals (i.e. for local, sub-sea, or satellite links). Therefore, 645 the indicated clock duration can be a static, compile time value. To 646 use the indicated clock interval, for example to perform one-way 647 delay variation calculations, simple integer operations can be used 648 after an initial conversion of the wire representation to longer 649 (i.e. 32 or 64 bit) integer values (see Appendix B.1 for some 650 details). 652 5.3. Negotiation error detection and recovery 654 During the initial TCP three-way handshake, timestamp capabilities 655 are negotiated using the TSecr field. Timestamp capabilities MAY 656 only be negotiated in TSecr when the SYN bit is set. A host detects 657 the presence of timestamp capability flags when the EXO bit is set in 658 the TSecr field of the received segment. When receiving a 659 session request ( segment with timestamp capabilities), a 660 compliant TCP receiver is required to XOR the received TSval with the 661 receivers timestamp capabilities. The resulting value is then sent 662 in the response. 664 To support these design goals stated in Section 4, only the TSecr 665 field in the initial can be used directly. The response from 666 the receiver has to be encoded, since no unused field is available in 667 the . The most straightforward encoding is a XOR with a 668 value that is known to the sender. Therefore, the receiver also uses 669 TSecr to indicate its capabilities, but calculates the XOR sum with 670 the received TSval. This allows the receiver to remain stateless and 671 functionality like SYN Cache (see [RFC4987]) can be maintained with 672 no change. 674 If a sender has to retransmit the , this encoding also allows to 675 detect which segment was received and responded to. This is possible 676 by changing the timestamp clock offset between retransmissions in 677 such a way, that the decoding on the sender side would result in an 678 invalid timestamp capability negotiation field (e.g. some RES bits 679 are set). If the sender does not require the capability to 680 differentiate which was received, the timestamp clock offset 681 for each new can be set in such a way, that the TSopt of the 682 is identical for each retransmission. 684 As a receiver MAY report back a zero value at any time, in particular 685 during the , the sender is slightly constrained in its 686 selection of an initial Timestamp value. The Timestamp value sent in 687 the should be selected in such a way, that it does not resemble 688 a valid Timestamp capabilities field. One approach to ensure this 689 property is that the sender makes sure that at least one bit of the 690 RES field is set. This prevents a compliant sender to erroneously 691 detect a compliant receiver, if the returned TSecr value is zero. 693 A host initiating a TCP session must verify if the partner also 694 supports timestamp capability negotiation and a supported version, 695 before using enhanced algorithms. Note that this change in semantics 696 does not necessarily change the signaling of timestamps on the wire 697 after initial negotiation. 699 To mitigate the effect from misbehaving TCP senders appearing to 700 negotiate for timestamp capabilities, a receiver MUST verify that one 701 specific bit (EXO) is set, and any reserved bits (currently 8, RES 702 field) are cleared. This limits the chance for a receiver to 703 mistakenly negotiate for version 0 capabilities in the presence of a 704 misbehaving sender to around 0.05%. The prevalence of misbehaving 705 senders, and distribution of observed TSecr values, limits this to 706 less than 1 in 6 million. The modifications described in 707 [I-D.ietf-tcpm-1323bis] and implemented in a receiver would further 708 decrease the false negotiation to less then 10^-7. 710 However, as a receiver has to use changed semantics when reflecting 711 TSval also for higher values in the version field, a misbehaving 712 sender negotiating for SACK, but not properly clearing TSecr, may 713 have a 37.5% chance of receiving timestamp values with modified 714 receiver behavior (from an approximate population of 0.00036% of 715 sessions being started without a cleared TSecr). This may lead to an 716 increased number of spurious retransmission timeouts, putting such a 717 session from a misbehaving TCP sender to a disadvantage. 719 Once timestamp capabilities are successfully negotiated, the receiver 720 MUST ignore an indicated number of masked, low-order bits, before 721 applying the heuristics defined in [RFC1323]. The monotonic increase 722 of the timestamp value for each new segment could be violated if the 723 full 32 bit field, including the masked bits, are used. This 724 conflicts with the constraints imposed by PAWS. 726 The presented distribution of the common three fields, EXO, VER and 727 MASK, that MUST be present regardless of which version is implemented 728 in a compliant TCP stack, is a result of the previously mentioned 729 design goals. The lower three octets MAY be redefined freely with 730 subsequent versions of the timestamp capability negotiation protocol. 731 This allows a future version to be implemented in such a way, that a 732 receiver can still operate with the modified behavior, and a minimum 733 amount of processing (PAWS) 735 5.4. Interaction with Selective Acknowledgment 737 If both Timestamp capabilities and Selective Acknowledgement options 738 [RFC2018] are negotiated (both hosts send these options in their 739 respective handshake segments), both hosts MUST echo the timestamp 740 value of the last received segment, irrespective of the order of 741 delivery. Note that this is in conflict with [RFC1323], where only 742 the timestamp of the last segment received in sequence is mirrored. 743 As SACK allows discrimination of reordered or lost segments, the 744 reflected timestamp is not required to convey the most conservative 745 information. If SACK indicates lost or reordered packets at the 746 receiver, the sender MUST take appropriate action such as ignoring 747 the received timestamps for calculating the round trip time, or 748 assuming a delayed packet (with appropriate handling). An updated 749 algorithm to calculate the retransmission timeout timer (RTO) is 750 beyond the scope of this document. 752 The immediate echoing of the last received timestamp value allowed by 753 the simultaneous use of the timestamp option with the SACK option 754 enables enhancements to improve loss recovery, round trip time (RTT) 755 and one-way delay (OWD) variation measurements (see Appendix B) even 756 during loss or reordering episodes. This is enabled by removing any 757 retransmission ambiguity using unique timestamps for every 758 retransmission, while simultaneously the SACK option indicates the 759 ordering of received segments even in the presence of ACK loss or 760 reordering. 762 For legacy applications of the timestamp option such as RTTM and 763 PAWS, the presence of the SACK option gives a clear indication of 764 loss or reordering. Under these circumstances, RTTM should not be 765 invoked even under [RFC1323], but often is, due to separate handling 766 of timestamp and SACK options). 768 The use of RTT and OWD measurements during loss episodes is an open 769 research topic. A sender has to accommodate for the changed 770 timestamp semantics in order to maintain a conservative estimate of 771 the Retransmission Timer as defined in [RFC6298], when a TCP sender 772 has negotiated for an immediate reflection of the timestamp 773 triggering an ACK (i.e. both timestamp capability negotiation and 774 Selective Acknowledgements are enabled for the session). As the 775 presence of a SACK option in an ACK signals an ongoing reordering or 776 loss episode, timestamps conveyed in such segments MUST NOT be used 777 to update the retransmission timeout. Also note that the presence of 778 a SACK option alleviates the need of the receiver to reflect the last 779 in-order timestamp, as lost ACKs can no longer cause erroneous 780 updates of the retransmission timeout. 782 5.4.1. Interaction with the Retransmission Timer 784 The above stated rule, to ignore timestamps as soon as a SACK option 785 is present, is fully consistent with the guidance given in [RFC1323], 786 even though most implementations skip over such an additional 787 verification step in the presence of the SACK option. 789 To address the additional delay imposed by delayed ACKs, a compliant 790 sender SHOULD modify the update procedure when receiving normal, in- 791 sequence ACKs that acknowledge more than SMSS bytes, so that the 792 input (denoted R in [RFC6298]) is calculated as 794 R = RTT * ( 1 + 1/(cwnd/smss) ) 796 If RTT (as measured in units of the timestamp clock) is smaller than 797 the congestion window measured in full sized segments, the above 798 heuristic MAY be bypassed before updating the retransmission timeout 799 value. 801 5.4.2. Interaction with the PAWS test 803 The PAWS test as defined in [RFC1323] requires constant monotonic 804 increasing values at the receiver side. As TS.Recent is no longer 805 used to track which timestamp to echo, this variable can be reused. 806 Instead of tracking the timestamp sent in the most recent ACK, a more 807 strict update rule could be used: 809 "For example, we might save the timestamp from the segment that 810 last advanced the left edge of the receive window, i.e., the most 811 recent in-sequence segment." 813 TS.Recent is only to be updated whenever the left window advances, 814 but no longer has to consider delayed ACKs. 816 5.5. Discussion 818 RTT and OWD variation during loss episodes is not deeply researched. 819 Current heuristics ([RFC1122], [RFC1323], Karn's algorithm [RFC2988]) 820 explicitly exclude (and prevent) the use of RTT samples when loss 821 occurs. However, solving the retransmission ambiguity problem - and 822 the related reliable ACK delivery problem - would enable new 823 functionality to improve TCP processing. Also, having an immediate 824 echo of the last received timestamp value would enable new research 825 to distinguish between corruption loss (assumed to have no RTT / OWD 826 impact) and congestion loss (assumed to have RTT / OWD impact). 827 Research into this field appears to be rather neglected, especially 828 when it comes to large scale, public internet investigations. Due to 829 the very nature of this, passive investigations without signals 830 contained within the headers are only of limited use in empirical 831 research. 833 Retransmission ambiguity detection during loss recovery would allow 834 an additional level of loss recovery control without reverting to 835 timer-based methods. As with the deployment of SACK, separating 836 "what" to send from "when" to send it could be driven one step 837 further. In particular, less conservative loss recovery schemes 838 which do not trade principles of packet conservation against 839 timeliness, require a reliable way of prompt and best possible 840 feedback from the receiver about any delivered segment and their 841 ordering. [RFC2018] SACK alone goes quite a long way, but using 842 timestamp information in addition could remove any ambiguity. 843 However, the current specs in [RFC1323] make that use impossible, 844 thus a modified semantic (receiver behavior) is a necessity. 846 A change in signaling with immediate timestamp value echoes would 847 however break some legacy, per-packet RTT measurements. The reason 848 is, that delayed ACKs would not be covered. Research has shown, that 849 per-packet updates of the RTT estimation (for the purpose of 850 calculating a reasonable RTO value) are only of limited benefit (see 851 [Path99], and [PH04]). This is the most serious implication of the 852 proposed signaling scheme with directly echoing the timestamp value 853 of the segment triggering the ACK, when the SACK options is also 854 negotiated for. Even when using the directly reflected timestamp 855 values in an unmodified RTT estimator, the immediate impact would be 856 limited to causing premature RTOs when the sending rate suddenly 857 drops below two segments per RTT. That is, assuming the receiver 858 implements delayed ACK and sending one ACK for every other data 859 segment received. If the receiver has also D-SACK [RFC2883] enabled, 860 such premature RTOs can be detected and mitigated by the sender (for 861 example, by increasing minRTO for low bandwidth flows). 863 Allowing timestamps to play a more important role in TCP signaling 864 also gives rise to concerns. When the timestamp is used for 865 congestion control purposes, this gives an incentive for malicious 866 receivers to reflect tampered timestamps. During the early phases of 867 the introduction of Cubic, such modifications where shown to result 868 in unfair advantages to malicious receivers, that selectively alter 869 the reflected timestamp values (see [CUBIC]). For that very reason, 870 this document introduces the explicit possibility to include a signal 871 in the timestamp values that is excluded from any processing by the 872 receiver. A sender can then decide how to make use of this 873 capability, e.g. for use as additional security information, 874 improvements of loss recovery or other, yet unknown, means. 876 6. Acknowledgements 878 The authors would like to thank Dragana Damjanovic for some initial 879 thoughts around Timestamps and their extended potential use. 881 We would like to thank Bob Briscoe for his insightful comments, and 882 the gratuitous donation of text, that have resulted in a 883 substantially improved document. 885 We further want to thank Michael Welzl and Brian Trammell for their 886 input and discussion. Special thanks to Brian, who showed how to do 887 more with less. 889 7. Updates to Existing RFCs 891 Care has been taken to make sure the updates in this specification 892 can be deployed incrementally. 894 Updates to existing [RFC1323] implementations are only REQUIRED if 895 they do not clear the TSecr value in the initial segment. This 896 is a misinterpretation of [RFC1323] and may leak data anyway (see 897 [I-D.ietf-tcpm-tcp-security]). Also see [I-D.ietf-tcpm-1323bis], as 898 this stipulates, that the TSval sent in a should be zeroed, 899 further reducing the chance for a false positive. It is expected, 900 that these changes are implemented in stacks making use of timestamp 901 negotiation. Otherwise, there will be no need to update an RFC1323- 902 compliant TCP stack unless the timestamp capabilities negotiation is 903 to be used. 905 Implementations compliant with the definitions in this document shall 906 be prepared to encounter misbehaving senders, that don't clear TSecr 907 in their initial . It is believed, that checking the reserved 908 bits to be all zero provides enough protection against misbehaving 909 senders. 911 In the unlikely case of an erroneous negotiation of timestamp 912 capabilities between a compliant receiver, and a misbehaving sender, 913 the proposed semantic changes will trigger a higher rate of spurious 914 retransmissions, while time-based heuristics on the receiver side may 915 further negatively impact congestion control decisions. Overall, 916 misbehaving receivers will suffer from self-inflicted reductions in 917 TCP performance. 919 8. IANA Considerations 921 With this document, the IANA is requested to establish a new registry 922 to record the timestamp capability flags defined with future versions 923 (codepoints 1, 2 and 3). 925 The lower 24 bits (3 octets) of the timestamp capabilities field may 926 be freely assigned in future versions. The first octet must always 927 contain the EXO, VER and MASK fields for compatibility, and the MASK 928 field MUST be set to allow interoperation with a version 0 receiver. 930 This document specifies version 0 and the use of the last three 931 octets to signal the senders timestamp clock interval to the 932 receiver. 934 9. Security Considerations 936 The algorithm presented in this paper shares security considerations 937 with [RFC1323] (see [I-D.ietf-tcpm-tcp-security]). 939 An implementation can address the vulnerabilities of [RFC1323], by 940 dedicating a few low-order bits of the timestamp fields for use with 941 a (secure) hash, that protects against malicious modification of 942 returned timestamp value by the receiver. A MASK field has been 943 provided to explicitly notify the receiver about that alternate use 944 of low-order bits. This allows the use of timestamps for purposes 945 requiring higher integrity and security while allowing the receiver 946 to extract useful information nevertheless. 948 10. References 950 10.1. Normative References 952 [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions 953 for High Performance", RFC 1323, May 1992. 955 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 956 Selective Acknowledgment Options", RFC 2018, October 1996. 958 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 959 Requirement Levels", BCP 14, RFC 2119, March 1997. 961 10.2. Informative References 963 [BSD10] Hayes, D., "Timing enhancements to the FreeBSD kernel to 964 support delay and rate based TCP mechanisms", Feb 2010, . 968 [CUBIC] Rhee, I., Ha, S., and L. Xu, "CUBIC: A New TCP-Friendly 969 High-Speed TCP Variant", Feb 2005, . 973 [Chirp] Kuehlewind, M. and B. Briscoe, "Chirping for Congestion 974 Control - Implementation Feasibility", Nov 2010, . 977 [Cho08] Cho, I., Han, J., and J. Lee, "Enhanced Response Algorithm 978 for Spurious TCP Timeout (ER-SRTO)", Jan 2008, . 983 [I-D.blanton-tcp-reordering] 984 Blanton, E., Dimond, R., and M. Allman, "Practices for TCP 985 Senders in the Face of Segment Reordering", 986 draft-blanton-tcp-reordering-00 (work in progress), 987 February 2003. 989 [I-D.ietf-ledbat-congestion] 990 Hazel, G., Iyengar, J., Kuehlewind, M., and S. Shalunov, 991 "Low Extra Delay Background Transport (LEDBAT)", 992 draft-ietf-ledbat-congestion-09 (work in progress), 993 October 2011. 995 [I-D.ietf-tcpm-1323bis] 996 Borman, D., Braden, R., Jacobson, V., and R. 997 Scheffenegger, "TCP Extensions for High Performance", 998 draft-ietf-tcpm-1323bis-03 (work in progress), July 2012. 1000 [I-D.ietf-tcpm-anumita-tcp-stronger-checksum] 1001 Biswas, A., "Support for Stronger Error Detection Codes in 1002 TCP for Jumbo Frames", 1003 draft-ietf-tcpm-anumita-tcp-stronger-checksum-00 (work in 1004 progress), May 2010. 1006 [I-D.ietf-tcpm-tcp-security] 1007 Gont, F., "Survey of Security Hardening Methods for 1008 Transmission Control Protocol (TCP) Implementations", 1009 draft-ietf-tcpm-tcp-security-03 (work in progress), 1010 March 2012. 1012 [I-D.sabatini-tcp-sack] 1013 Sabatini, A., "Highly Efficient Selective Acknowledgement 1014 (SACK) for TCP", draft-sabatini-tcp-sack-00 (work in 1015 progress), February 2012. 1017 [Linux] Sarolahti, P., "Linux TCP", Apr 2007, 1018 . 1020 [PH04] Eckstroem, H. and R. Ludwig, "The Peak-Hopper: A New End- 1021 to-End Retransmission Timer for Reliable Unicast 1022 Transport", Apr 2004, . 1025 [Path99] Allman, M. and V. Paxson, "On Estimating End-to-End 1026 Network Path Properties", Sep 1999, 1027 . 1029 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1030 Communication Layers", STD 3, RFC 1122, October 1989. 1032 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1033 Extension to the Selective Acknowledgement (SACK) Option 1034 for TCP", RFC 2883, July 2000. 1036 [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission 1037 Timer", RFC 2988, November 2000. 1039 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm 1040 for TCP", RFC 3522, April 2003. 1042 [RFC4015] Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm 1043 for TCP", RFC 4015, February 2005. 1045 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 1046 Mitigations", RFC 4987, August 2007. 1048 [RFC6013] Simpson, W., "TCP Cookie Transactions (TCPCT)", RFC 6013, 1049 January 2011. 1051 [RFC6247] Eggert, L., "Moving the Undeployed TCP Extensions RFC 1052 1072, RFC 1106, RFC 1110, RFC 1145, RFC 1146, RFC 1379, 1053 RFC 1644, and RFC 1693 to Historic Status", RFC 6247, 1054 May 2011. 1056 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 1057 "Computing TCP's Retransmission Timer", RFC 6298, 1058 June 2011. 1060 Appendix A. Pseudo Code examples 1062 A.1. Sender 1064 Active Open: 1066 (start) 1067 TS.recent = TS.clock + TS.offset 1068 if [TS.recent].EXO == 1 && 1069 [TS.recent].VER == 0 && 1070 [TS.recent].RES == 0 1071 TS.offset += PRNG.random 1072 goto start 1074 1076 Retransmission: 1078 TS.offset = TS.clock - TS.recent + 0x00010000 1079 TS.recent = TS.clock + TS.offset 1080 1082 Receive : 1084 TS.field = TSecr XOR TS.recent 1085 if [TS.field].EXO == 1 1086 if [TS.field].VER == 0 1087 if [TS.field].RES == 0 1088 if [TS.field].ADJ == 0 AND [TS.field].INT == 0 1089 disable enhanced time-based heuristics, 1090 use PAWS with MSK adjustment 1091 enable immediate TSval echoing 1092 if [TS.field].ADJ != 0 AND [TS.field].INT == 0 1093 negotiation failed, 1094 use legacy RFC1323 timestamp handling 1095 if [TS.field].INT != 0 1096 TS.interval = [TS.field].INT << [TS.field].ADJ 1097 use PAWS with MSK adjustment 1098 enable immediate TSval echoing 1099 (possibly do precision determination) 1100 if [TS.field].RES != 0 AND was retransmitted 1101 determine if old TSval would fit, 1102 update TS.offset and repeat process 1103 if [TS.field].VER != 0 1104 disable enhanced time-based heuristics, 1105 use PAWS with MSK adjustment 1106 enable immediate TSval echoing 1108 A.2. Receiver 1110 receive : 1112 TS.field = .TSecr 1113 TS.return = 0 1114 if [TS.field].EXO == 1 1115 if [TS.field].VER == 0 1116 if [TS.field].RES == 0 1117 if [TS.field].ADJ == 0 AND [TS.field].INT == 0 1118 disable enhanced time-based heuristics, 1119 use PAWS with MSK adjustment 1120 enable immediate TSval echoing 1121 TS.return = local capabilities 1122 if [TS.field].ADJ != 0 AND [TS.field].INT == 0 1123 negotiation failed, 1124 use legacy RFC1323 timestamp handling 1125 TS.return = 0 1126 if [TS.field].INT != 0 1127 TS.interval = [TS.field].INT << [TS.field].ADJ 1128 use PAWS with MSK adjustment 1129 enable immediate TSval echoing 1130 (possibly do precision determination) 1131 TS.return = local capabilities 1132 if [TS.field].VER != 0 1133 disable enhanced time-based heuristics, 1134 use PAWS with MSK adjustment 1135 enable immediate TSval echoing 1136 TS.return = local capabilities 1137 TSecr = .TSval XOR TS.return 1139 Appendix B. Possible use cases 1141 B.1. One-way delay variation measurement 1143 New congestion control algorithms are currently proposed, that react 1144 on the measured one-way delay variation (i.e. 1145 [I-D.ietf-ledbat-congestion], [Chirp]). This control variable is 1146 updated after each received ACK: 1148 C(t) = TSval(t) - TSecr(t) 1150 V(t) = C(t) - C(t-1) 1152 provided that the timestamp clocks at both ends are running at 1153 roughly the same rate. Without prior knowledge of the timestamp 1154 clock interval used by the partner, a sender can try to learn this 1155 interval by observing the exchanged segments for a duration of a few 1156 RTTs. However, such a scheme fails if the partner uses some form of 1157 implicit integrity check of the timestamp values, which would appear 1158 as either random scrambling of LSB bits in the timestamp, or give the 1159 impression of much shorter clock intervals than what is actually 1160 used. If the partner uses some form of segment counting as timestamp 1161 value, without any direct relationship to the wall-clock time, the 1162 above formula will fail to yield meaningful results. Finally the 1163 network conditions need to remain stable during any such training 1164 phase, so that the sender can arrive at reasonable estimates of the 1165 partners timestamp clock tick duration. 1167 This note addresses these concerns by providing a means by which both 1168 host are required to use a timestamp clock that is closely related to 1169 the wall-clock time, with known clock rate, and also provides means 1170 by which a host can signal the use of a few LSB bits for timestamp 1171 value integrity checks. To arrive at a valid one-way delay (OWD) 1172 variation, first the timestamp received from the partner has to be 1173 right-shifted by a known amount of bits as defined by the mask field. 1174 Next the local and remote timestamp values need to be normalized to a 1175 common base clock interval (typically, the local clock interval): 1177 remote interval 1178 C = (TSecr >> local mask) - (TSval >> remote mask) * --------------- 1179 t local interval 1181 V(t) = C(t) - C(t-1) 1183 The adjustment factor can be calculated once during the timestamp 1184 capability negotiation phase, and pure integer arithmetic can be used 1185 during per-segment processing: 1187 EXP.min = min(EXP.loc, EXP.rem) 1189 EXP.rem -= EXP.min 1191 EXP.loc -= EXP.min 1193 FRAC.rem = (0x800 | FRAC.rem) << EXP.rem 1195 FRAC.loc = (0x800 | FRAC.loc) << EXP.loc 1197 and assuming that the local clock tick duration is lower 1199 ADJ = FRAC.rem / FRAC.loc 1201 with ADJ being a integer variable. For higher precision, two 1202 appropriately calculated integers can be used. 1204 Any previously required training on the remote clock interval can be 1205 removed, resulting in a simpler and more dependable algorithm. 1206 Furthermore, transient network effects during the training phase 1207 which may result in a wrong inference of the remote clock interval 1208 are eliminated completely. 1210 B.2. Timestamp clock rate exposure 1212 Today, each TCP host may use an arbitrary, locally defined clock 1213 source to derive the timestamp value from. Even though only a 1214 handful of typically used clock rates are implemented in common TCP 1215 stacks, this does not guarantee that any future stack will choose the 1216 same clock rate. This poses a problem for current state of the art 1217 heuristics, which try to determine the senders timestamp clock rate 1218 by pure passive observation of the TCP stream, and affects both 1219 advanced heuristics in the partner host of a TCP session, or 1220 arbitrarily located passive observation points to estimate TCP 1221 session parameters. 1223 The proposed mechanism would reveal this information explicitly, even 1224 though other environmental factors, such as the operation of a TCP 1225 stack in a virtualized environment, may result in some deviations in 1226 the actually used clock rate. 1228 High-speed and real-time stacks would be expected to operate with 1229 higher clock rates, while the observed variance in (known) timestamp 1230 clock vs. reference clock could help in determining between physical 1231 and virtual end hosts, for example. 1233 B.3. Early spurious retransmit detection 1235 Using the provided timestamp negotiation scheme, clients utilizing 1236 slow running timestamp clocks can set aside a small number of least 1237 significant bits in the timestamps. These bits can be used to 1238 differentiate between original and retransmitted segments, even 1239 within the same timestamp clock tick (i.e. when RTT is shorter than 1240 the TCP timestamp clock interval). It is recommended to use only a 1241 single bit (mask = 1), unless the sender can also perform lost 1242 retransmission detection. Using more than 2 bits for this purpose is 1243 discouraged due to the diminishing probability of loosing 1244 retransmitted packets more than one time. A simple scheme could send 1245 out normal data segments with the so masked bits all cleared. Each 1246 advance of the timestamp clock also clears those bits again. When a 1247 segment is retransmitted without the timestamp clock increasing, 1248 these bits increased by one for each consecutive retry of the same 1249 segment, until the maximum value is reached. Newly sent segments 1250 (during the same clock interval) should maintain these bits, in order 1251 to maintain monotonically increasing values, even though compliant 1252 end hosts do not require this property. This scheme maintains 1253 monotonically increasing timestamp values - including the masked 1254 bits. Even without negotiating the immediate mirroring of timestamps 1255 (done by simultaneously doing timestamp capabilities negotiation, and 1256 selective acknowledgments), this extends the use of the Eifel 1257 Detection [RFC3522] and Eifel Response [RFC4015] algorithm to detect 1258 and react to spurious retransmissions under all circumstances. Also, 1259 currently experimental schemes such as ER-SRTO [Cho08] could be 1260 deployed without requiring the receiver to explicitly support that 1261 capability. 1263 Seg0 Seg1 Seg2 Seg3 Seg4 1264 TS00 TS00 TS00 TS00 TS00 1265 X 1267 Seg1 Seg5 1268 TS01 TS01 1270 Seg6 Seg7 1271 TS01 TS10 1273 Figure 4: timestamp for spurious retranmit detection 1275 Masked bits are the 2nd digit, the timestamp value is represented by 1276 the first digit. The timestamp clock "ticks" between segment 6 and 1277 7. 1279 B.4. Early lost retransmission detection 1281 During phases where multiple segments in short succession (but not 1282 necessarily successive segments) are lost, there is a high likelihood 1283 that at least one segment is retransmitted, while the cause of loss 1284 (i.e. congestion, fading) is still persisting. The best current 1285 algorithms can recover such a lost retransmission with a few 1286 constraints, for example, that the session has to have at least 1287 DupThresh more segments to send beyond the current recovery phase. 1288 During loss recovery, when a retransmission is lost again, currently 1289 the timestamp can also not be used as means of conveying additional 1290 information, to allow more rapid loss recovery while maintaining 1291 packet conservation principles. Only the timestamp of the last 1292 segment preceding the continuous loss will be reflected. Using the 1293 extended timestamp option negotiation together with selective 1294 acknowledgements, the receiver will immediately reflect the timestamp 1295 of the last seen segment. Using both SACK and TS information in 1296 conjunction with each other, a sender can infer the exact order in 1297 which original and retransmitted segments are received. This allows 1298 faster recovery from lost retransmissions while maintaining the 1299 principle of packet conservations and avoiding costly retransmission 1300 timeouts. 1302 The implementation can be done in combination with the masked bit 1303 approach described in the previous paragraph, or without. However, 1304 if the timestamp clock interval is lower than 1/2 RTT, both the 1305 original and the retransmitted segment may carry an identical 1306 timestamp. If the sender cannot discriminate between the original 1307 and the retransmitted segments, is must refrain from taking any 1308 action before such a determination can be made. 1310 In this example, masked bits are used, with a simple marking method. 1311 As the timestamp value of the retransmission itself is already 1312 different from the original segments, such an additional 1313 discrimination would not strictly be required here. The timestamp 1314 clock ticks in the first digit and the dupthresh value is 3. 1316 Seg0 Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 1317 TS00 TS00 TS00 TS10 TS10 TS10 TS10 TS20 1318 X X X * 1320 Seg1 Seg2 Seg3 Seg4 1321 TS21 TS30 TS30 TS30 1322 X 1324 Seg1 Seg8 Seg9 1325 TS31 TS31 TS40 1327 Figure 5: timestamp under loss 1329 If Seg1,TS00 is lost twice, and Seg4,TS10 is also lost, the sender 1330 could resend Seg1 once more after observing dupthresh number of 1331 segments sent after the first retransmission of Seg1 being received 1332 (ie, when Seg4 is SACKed). However, there is an ambiguity between 1333 retransmitted segments and original segments, as the sender cannot 1334 know, if a SACK for one particular segment was due to the 1335 retransmitted segment, or a delayed original segment. The timestamp 1336 value will not help in this case, as per RFC1323 it will be held at 1337 TS00 for the entire loss recovery episode. Therefore, currently a 1338 sender has to assume that any SACKed segments may be due to delayed 1339 original sent segments, and can only resolve this conflict by 1340 injecting additional, previously unsent segments. Once dupthresh 1341 newly injected segments are SACKed, continuous loss (and not further 1342 delay) of Seg1 can safely be assumed, and that segment be resent. 1343 This approach is conservative but constrained by the requirement that 1344 additional segments can be sent, and thereby delayed in the response. 1346 With the simultaneous use of timestamp extended options together with 1347 selective acknowledgments, the receiver would immediately reflect 1348 back the timestamp of the last received segment. This allows the 1349 sender to discriminate between a SACK due to a delayed Seg4,TS10, or 1350 a SACK because of Seg4,TS30. Therefore, the appropriate decision 1351 (retransmission of Seg1 once more, or addressing the observed 1352 reordering/delay accordingly [I-D.blanton-tcp-reordering] can be 1353 taken with high confidence. 1355 B.5. Integrity of the Timestamp value 1357 If the timestamp is used for congestion control purposes, an 1358 incentive exists for malicious receivers to reflect tampered 1359 timestamps, as demonstrated with some exploits [CUBIC]. 1361 One way to address this is to not use timestamp information directly, 1362 but to keep state in the sender for each sent segment, and track the 1363 round trip time independent of sent timestamps. Such an approach has 1364 the drawback, that it is not straightforward to make it work during 1365 loss recovery phases for those segments possibly lost (or reordered). 1366 In addition there is processing and memory overhead to maintain 1367 possibly extensive lists in the sender that need to be consulted with 1368 each ACK. Despite these drawbacks, this approach is currently 1369 implemented due to lack of alternatives (see [Linux], and [BSD10]). 1371 The preferred approach is that the sender MAY choose to protect 1372 timestamps from such modifications by including a fingerprint (secure 1373 hash of some kind) in some of the least significant bits. However, 1374 doing so prevents a receiver from using the timestamp for other 1375 purposes, unless the receiver has prior knowledge about this use of 1376 some bits in the timestamp value. Furthermore, strict monotonic 1377 increasing values are still to be maintained. That constraint 1378 restricts this approach somewhat and limits or inhibits the use of 1379 timestamp values for direct use by the receiver (i.e. for one-way 1380 delay variation measurement, as the hash bits would look like random 1381 noise in the delay measurement). 1383 B.6. Disambiguation with slow Timestamp clock 1385 In addition, but somewhat orthogonal to maintaining timestamp value 1386 integrity, there is a use case when the sender does not support a 1387 timestamp clock interval that can guarantee unique timestamps for 1388 retransmitted segments. This may happen whenever the TCP timestamp 1389 clock interval is higher than the round-trip time of the path. For 1390 unambiguously identifying regular from retransmitted segments, the 1391 timestamp must be unique for otherwise identical segments. Reserving 1392 the least significant bits for this purpose allows senders with slow 1393 running timestamp clocks to make use of this feature. However, 1394 without modifying the receiver behavior, only limited benefits can be 1395 extracted from such an approach. Furthermore the use of this option 1396 has implications in the protection against wrapped sequence numbers 1397 (PAWS - [RFC1323]), as the more bits are set aside for tamper 1398 prevention, the faster the timestamp number space cycles. 1400 Using Timestamp capabilities to explicitly negotiate mask bits, and 1401 set aside a (low) number of least significant bits for the above 1402 listed purposes, allows a sender to use more reliable integrity 1403 checks. These masked bits are not to be considered part of the 1404 timestamp value, for the purposes described in [RFC1323] (i.e. PAWS) 1405 and subsequent heuristics using timestamp values (i.e. Eifel 1406 Detection), thereby lifting the strict requirement of always 1407 monotonically increasing timestamp values. However, care should be 1408 taken to not mask too many bits, for the reasons outlined in 1409 [RFC1323]. Using a mask value higher than 8 is therefore 1410 discouraged. 1412 The reason for having 5 bits for the mask field nevertheless is to 1413 allow the implementation of this protocol in conjunction with TCP 1414 cookie transaction (TCPCT) extended timestamps [RFC6013]. That 1415 allows for nearly a quarter of a 128 bit timestamp to be set aside. 1417 B.7. Masked timestamps as segment digest 1419 After making TCP alternate checksums historic (see [RFC6247]), there 1420 still remains a need to address increased corruption probabilities 1421 when segment sizes are increased (see 1422 [I-D.ietf-tcpm-anumita-tcp-stronger-checksum]). 1424 Utilizing a completely masked TSval field allows the sender to 1425 include a stronger CRC32, with semantics independent of the fixed TCP 1426 header fields. However, such a use would again exclude the use of 1427 PAWS on the receiver side, and a receiver would need to know the 1428 specifics of the digest for processing. It is assumed, that such a 1429 digest would only cover the data payload of a TCP segment. In order 1430 to allow disambiguation of retransmissions, a special TSval can be 1431 defined (e.g. TSval=0) which bypasses regular CRC processing but 1432 allows the identification of retransmitted segments. 1434 The full semantics of such a data-only CRC scheme are beyond the 1435 scope of this document, but would require a different version of the 1436 timestamp capability. Nevertheless, allowing the full TSval to 1437 remain unprocessed by the receiver for the purpose of PAWS even in 1438 version 0 could still allow the successful negotiation of sender-side 1439 enhancements such as loss recovery improvements (see Appendix B.3, 1440 and Appendix B.4). 1442 In effect, the masked portion of the timestamp value represent an 1443 unreliable out of band signal channel, that could also be used for 1444 other purposes than solely performing timestamp integrity checks (for 1445 example, this would allow ER-SRTO algorithms [Cho08]). 1447 Appendix C. Revision history 1449 This appendix should be removed by the RFC Editor before publishing 1450 this document as a RFC. 1452 00 ... initial draft, early submission to meet deadline. 1454 01 ... refined draft, focusing only on those capabilities that have 1455 an immediate use case. Also excluding flags that can be substituted 1456 by other means (MIR - synergistic with SACK option only, RNG moved to 1457 appendix A, BIA removed and the exponent bias set to a fixed value. 1458 Also extended other paragraphs. 1460 02 ... updated document after IETF80 - referrals to "timestamp 1461 options" were seen to be ambiguous with "timestamp option", and 1462 therefore replaced by "timestamp capabilities". Also, the document 1463 was reworked to better align with RFC4101. Removed SGN and increased 1464 FRAC to allow higher precision. 1466 03 ... removed references to "opaque" and "transparent". substituted 1467 "timestamp clock interval" for all instances of rate. Changed signal 1468 encoding to resemble a scale/value approach like what is done with 1469 Window Scaling. As added benefit, clock quality can be implicitly 1470 signaled, since multiple representations can map to idential time 1471 intervals. Added discussion around resilience against broken RFC1323 1472 implementations (Win95, Linux 2.3.41+), which deviate from expected 1473 Timestamp signaling behavior. 1475 04 ... removed previous appendix A (range negotiation); minor edit to 1476 improve wording; moved Section 6 to the Appendix, and removed covert 1477 channels from the potential uses; added some text to discuss future 1478 versioning (compatible and incompatible variants); changed document 1479 structure; added guidance around PAWS; added pseudo-code examples 1480 (probably to be removed again) 1482 Authors' Addresses 1484 Richard Scheffenegger 1485 NetApp, Inc. 1486 Am Euro Platz 2 1487 Vienna, 1120 1488 Austria 1490 Phone: +43 1 3676811 3146 1491 Email: rs@netapp.com 1493 Mirja Kuehlewind 1494 University of Stuttgart 1495 Pfaffenwaldring 47 1496 Stuttgart 70569 1497 Germany 1499 Email: mirja.kuehlewind@ikr.uni-stuttgart.de