| < draft-van-beijnum-multi-mtu-00.txt | draft-van-beijnum-multi-mtu-01.txt > | |||
|---|---|---|---|---|
| Network Working Group I. van Beijnum | Network Working Group I. van Beijnum | |||
| Internet-Draft Consultant | Internet-Draft Consultant | |||
| Expires: December 29, 2007 June 29, 2007 | Expires: Febrary 29, 2008 August 29, 2007 | |||
| IPv6 Extensions for Multi-MTU Subnets | IPv6 Extensions for Multi-MTU Subnets | |||
| draft-van-beijnum-multi-mtu-00 | draft-van-beijnum-multi-mtu-01 | |||
| Status of this Memo | Status of this Memo | |||
| By submitting this Internet-Draft, each author represents that any | By submitting this Internet-Draft, each author represents that any | |||
| applicable patent or other IPR claims of which he or she is aware | applicable patent or other IPR claims of which he or she is aware | |||
| have been or will be disclosed, and any of which he or she becomes | have been or will be disclosed, and any of which he or she becomes | |||
| aware will be disclosed, in accordance with Section 6 of BCP 79. | aware will be disclosed, in accordance with Section 6 of BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
| skipping to change at page 1, line 33 ¶ | skipping to change at page 1, line 33 ¶ | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
| http://www.ietf.org/ietf/1id-abstracts.txt. | http://www.ietf.org/ietf/1id-abstracts.txt. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| This Internet-Draft will expire on December 29, 2007. | This Internet-Draft will expire on Febrary 28, 2008. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The IETF Trust (2007). | Copyright (C) The IETF Trust (2007). | |||
| Abstract | Abstract | |||
| In the early days of the internet, many different link types with many | In the early days of the internet, many different link types with many | |||
| different maximum packet sizes were in use. For point-to-point or | different maximum packet sizes were in use. For point-to-point or | |||
| point-to-multipoint links, there are still some other link types (PPP, | point-to-multipoint links, there are still some other link types (PPP, | |||
| ATM, Packet over SONET), but shared subnets are almost exclusively | ATM, Packet over SONET), but shared subnets are almost exclusively | |||
| implemented as ethernets. Even though the relevant standards madate a | implemented as ethernets. Even though the relevant standards madate a | |||
| 1500 byte maximum packet size for ethernet, more and more ethernet | 1500 octet maximum packet size for ethernet, more and more ethernet | |||
| equipment is capable of handling packets bigger than 1500 bytes. | equipment is capable of handling packets bigger than 1500 octets. | |||
| However, since this capability isn't standardized, it's seldom used | However, since this capability isn't standardized, it's seldom used | |||
| today, despite the potential performance benefits of using larger | today, despite the potential performance benefits of using larger | |||
| packets. This document specifies a mechanism to negotiate per-neighbor | packets. This document specifies a mechanism to negotiate per-neighbor | |||
| maximum packet sizes so that nodes on a shared subnet may use the | maximum packet sizes so that nodes on a shared subnet may use the | |||
| maximum mutually supported packet size between them without being | maximum mutually supported packet size between them without being | |||
| limited by nodes with smaller maximum sizes on the same subnet. | limited by nodes with smaller maximum sizes on the same subnet. | |||
| 1 Introduction | 1 Introduction | |||
| Some protocols inherently generate small packets. Examples are VoIP, | Some protocols inherently generate small packets. Examples are VoIP, | |||
| where it's necessary to send packets frequently before much data can | where it's necessary to send packets frequently before much data can | |||
| be gathered to fill up the packet, and the DNS, where the queries are | be gathered to fill up the packet, and the DNS, where the queries are | |||
| inherently small and the returned results also rarely fill up a full | inherently small and the returned results also rarely fill up a full | |||
| 1500-byte packet. However, most data that is transferred across the | 1500-octet packet. However, most data that is transferred across the | |||
| internet and private networks is at least several kilobytes in size | internet and private networks is at least several kilobytes in size | |||
| (often much larger) and requires segmentation by TCP or another | (often much larger) and requires segmentation by TCP or another | |||
| transport protocol. These types of data transfer can benefit from | transport protocol. These types of data transfer can benefit from | |||
| larger packets in several ways: | larger packets in several ways: | |||
| 1. A higher data-to-header ratio makes for fewer overhead bytes | 1. A higher data-to-header ratio makes for fewer overhead bytes | |||
| 2. Fewer packets means fewer per-packet operations on the source and | 2. Fewer packets means fewer per-packet operations on the source and | |||
| destination hosts | destination hosts | |||
| 3. Fewer packets also means fewer per-packet operations in routers and | 3. Fewer packets also means fewer per-packet operations in routers and | |||
| middleboxes | middleboxes | |||
| 4. TCP performance tends to increase with larger packet sizes | 4. TCP performance tends to increase with larger packet sizes | |||
| Even though today, the capability to use larger packets (often called | Even though today, the capability to use larger packets (often called | |||
| jumboframes) is present in a lot of ethernet hardware, this capability | jumbo frames) is present in a lot of ethernet hardware, this | |||
| isn't used because IP assumes a common MTU size for all nodes | capability isn't used because IP assumes a common MTU size for all | |||
| connected to a link or subnet. In practice, this means that using a | nodes connected to a link or subnet. In practice, this means that | |||
| larger MTU requires manual configuration of the the non-standard MTU | using a larger MTU requires manual configuration of the the | |||
| size on all hosts and routers and possibly on switches. Also, the MTU | non-standard MTU size on all hosts and routers and possibly on | |||
| size for a subnet is limited to that of the least capable router, host | switches. Also, the MTU size for a subnet is limited to that of | |||
| or switch. | the least capable router, host or switch. | |||
| This document proposes to end this situation using several new IPv6 | This document proposes to end this situation using several new | |||
| options and messages: | options and messages: | |||
| 1. An additional router advertisement MTU option to limit higher | 1. An additional router advertisement MTU option to limit higher | |||
| maximum packet sizes | maximum packet sizes | |||
| 2. A new switch advertisement message, similar to a router | 2. A neighbor discovery option that allows nodes to inform their | |||
| advertisement message, so that switches can announce the maximum | ||||
| packet size they support | ||||
| 3. A neighbor discovery option that allows nodes to inform their | ||||
| neighbors of the maximum packet size they support | neighbors of the maximum packet size they support | |||
| 4. A new ICMPv6 message for confirming that packets with an increased | ||||
| maximum size can be transmitted and received successfully | ||||
| Nodes running IPv6 may take advantage of these mechanisms to send | 3. A neighbor discovery option for padding messages to make them | |||
| packets larger than the standard maximum size. Since IPv4 doesn't | suitable for probing a neighbor's MTU and link-layer MTU | |||
| support equivalent mechanisms, support for IPv4 requires additional | limitations | |||
| work that is best carried out after deployment experience with IPv6. | 4. Padding for ARP messages to make them suitable for probing a | |||
| neighbor's MTU and link-layer MTU limitations | ||||
| 2 Terminology | 2 Terminology | |||
| Local MTU: | ||||
| The maximum packet size considered usable on an interface, | ||||
| based on the physical MTU, the MTU advertised by routers and | ||||
| administrative settings. | ||||
| MTU: | MTU: | |||
| Maximum Transmission Unit. This is the maximum IP packet size in | Maximum Transmission Unit. This is the maximum IP packet size in | |||
| bytes supported on a link, towards a neighbor or towards a remote | octets supported on a link, towards a neighbor or towards a remote | |||
| correspondent. In some cases, the term MRU (maximum receive unit) | correspondent. In some cases, the term MRU (maximum receive unit) | |||
| would be more appropriate, but for consistency, the term MTU is | would be more appropriate, but for consistency, the term MTU is | |||
| used throughout this document. | used throughout this document. | |||
| Advised MTU: | ||||
| The MTU that is considered the best or safe choice at a given time | ||||
| on a given link. | ||||
| Allowed MTU: | ||||
| The maximum MTU allowed administratively. | ||||
| Local MTU: | ||||
| The maximum packet size considered usable on a node, based on the | ||||
| physical MTU, the allowed MTU and advised MTUs. | ||||
| Neighbor MTU: | Neighbor MTU: | |||
| The maximum packet size that may be used towards a given on-link | The maximum packet size that may be used towards a given | |||
| neighbor. | on-link neighbor. | |||
| Off-link MTU: | Node: | |||
| The maximum packet size that is appropriate for communicating with | A host or router running IPv4 or IPv6. | |||
| off-link correspondents. | ||||
| Oversized packet: | ||||
| A packet exceeding the size defined in the relevant | ||||
| IPv6-over-... or IP-over-... RFC. | ||||
| Physical MTU: | Physical MTU: | |||
| The MTU reported by the driver for an interface when operating at | The MTU reported by the driver for an interface when operating at | |||
| a given link speed. | a given link speed. | |||
| Tentative neighbor MTU: | Probe: | |||
| The maximum packet size advertised by a neighbor. | An ARP or neighbor solicitation packet of a specific (oversized) | |||
| size sent for the purpose of determining whether a neighbor can | ||||
| successfully receive packets of this size sent by the local node. | ||||
| 3 Disadvantages of larger packets | 3 Disadvantages of larger packets | |||
| Although often desirable, the use of larger packets isn't universally | Although often desirable, the use of larger packets isn't universally | |||
| advantageous for the following reasons: | advantageous for the following reasons: | |||
| 1. Increased delay and jitter | 1. Increased delay and jitter | |||
| 2. Increased reliance on path MTU discovery | 2. Increased reliance on path MTU discovery | |||
| 3. Increased packet loss through bit errors | 3. Increased packet loss through bit errors | |||
| 4. Increased risk of undetected bit errors | 4. Increased risk of undetected bit errors | |||
| skipping to change at page 4, line 9 ¶ | skipping to change at page 4, line 4 ¶ | |||
| 3 Disadvantages of larger packets | 3 Disadvantages of larger packets | |||
| Although often desirable, the use of larger packets isn't universally | Although often desirable, the use of larger packets isn't universally | |||
| advantageous for the following reasons: | advantageous for the following reasons: | |||
| 1. Increased delay and jitter | 1. Increased delay and jitter | |||
| 2. Increased reliance on path MTU discovery | 2. Increased reliance on path MTU discovery | |||
| 3. Increased packet loss through bit errors | 3. Increased packet loss through bit errors | |||
| 4. Increased risk of undetected bit errors | 4. Increased risk of undetected bit errors | |||
| 3.1 Delay and jitter | 3.1 Delay and jitter | |||
| An low-bandwidth links, the additional time it takes to transmit | An low-bandwidth links, the additional time it takes to transmit | |||
| larger packets may lead to unacceptable delays. For instance, | larger packets may lead to unacceptable delays. For instance, | |||
| transmitting a 9000-byte packet takes 7.23 milliseconds at 10 Mbps, | transmitting a 9000-octet packet takes 7.23 milliseconds at 10 Mbps, | |||
| while transmitting a 1500-byte packet takes only 1.23 ms. Once | while transmitting a 1500-octet packet takes only 1.23 ms. Once | |||
| transmission of a packet has started, additional traffic must wait for | transmission of a packet has started, additional traffic must wait for | |||
| the transmission to finish, so a larger maximum packet size | the transmission to finish, so a larger maximum packet size | |||
| immediately leads to a higher worst-case head-of-line blocking delay, | immediately leads to a higher worst-case head-of-line blocking delay, | |||
| and as such, to a bigger difference between the best and worst cases | and as such, to a bigger difference between the best and worst cases | |||
| (jitter). The increase in average delay depends on the number of | (jitter). The increase in average delay depends on the number of | |||
| packets that are buffered, the average packet size and the queuing | packets that are buffered, the average packet size and the queuing | |||
| strategy in use. Buffer sizes vary greatly, but assuming 40 buffers | strategy in use. Buffer sizes vary greatly, but assuming 40 buffers | |||
| (not uncommon) leads to the following results: | (not uncommon) leads to the following results: | |||
| Speed 500 1500 4500 9000 16384 65535 | Speed 500 1500 4500 9000 16384 65535 | |||
| 10 Mbps 17.22 49.21 145.22 289.22 525.50 2098.34 | 10 Mbps 17.22 49.21 145.22 289.22 525.50 2098.34 | |||
| 100 Mbps 1.72 4.92 14.52 28.92 52.55 209.83 | 100 Mbps 1.72 4.92 14.52 28.92 52.55 209.83 | |||
| 1 Gbps 0.17 0.49 1.45 2.89 5.26 20.98 | 1 Gbps 0.17 0.49 1.45 2.89 5.26 20.98 | |||
| 10 Gbps 0.02 0.05 0.15 0.29 0.52 2.01 | 10 Gbps 0.02 0.05 0.15 0.29 0.52 2.01 | |||
| In milliseconds and counting 38 additional bytes of ethernet overhead. | In milliseconds and counting 38 additional octets of ethernet | |||
| overhead. | ||||
| If we assume that the delays involved with 1500-byte packets on 100 | If we assume that the delays involved with 1500-octet packets on 100 | |||
| Mbps ethernet are acceptable for most, if not all, applications, then | Mbps ethernet are acceptable for most, if not all, applications, then | |||
| the conclusion must be that 9000-byte packets on 1 Gbps ethernet | the conclusion must be that 9000-octet packets on 1 Gbps ethernet | |||
| should also be acceptable. At 10 Gbps ethernet, much larger packet | should also be acceptable. At 10 Gbps ethernet, much larger packet | |||
| sizes could be accommodated without adverse impact on delay-sensitive | sizes could be accommodated without adverse impact on delay-sensitive | |||
| applications. Below 100 Mbps, larger packet sizes are probably not | applications. Below 100 Mbps, larger packet sizes are probably not | |||
| advisable. | advisable. | |||
| 3.2 Path MTU Discovery problems | 3.2 Path MTU Discovery problems | |||
| PMTUD issues arise when routers can't fragment packets in transit | PMTUD issues arise when routers can't fragment packets in transit | |||
| because the DF bit is set or because the packet is IPv6, but the | because the DF bit is set or because the packet is IPv6, but the | |||
| packet is too large to be forwarded over the next link, and the | packet is too large to be forwarded over the next link, and the | |||
| skipping to change at page 5, line 11 ¶ | skipping to change at page 5, line 8 ¶ | |||
| size) option makes sure that TCP packets conform to the limited MTU. | size) option makes sure that TCP packets conform to the limited MTU. | |||
| PMTUD problems are of course possible with non-TCP protocols, but this | PMTUD problems are of course possible with non-TCP protocols, but this | |||
| is rare in practice. | is rare in practice. | |||
| Taking the delay and jitter issues to heart, maximum packet sizes | Taking the delay and jitter issues to heart, maximum packet sizes | |||
| should be larger for faster links. This means that in the majority of | should be larger for faster links. This means that in the majority of | |||
| cases, the MTU bottleneck will tend to be at one of the ends of a | cases, the MTU bottleneck will tend to be at one of the ends of a | |||
| path, rather than somewhere in the middle. | path, rather than somewhere in the middle. | |||
| A crucial difference between PMTUD problems that result from MTUs | A crucial difference between PMTUD problems that result from MTUs | |||
| smaller than the standard 1500 bytes and PMTUD problems that result | smaller than the standard 1500 octets and PMTUD problems that result | |||
| from MTUs larger than the standard 1500 bytes is that in the latter | from MTUs larger than the standard 1500 octets is that in the latter | |||
| case, only a party that's actually using the non-standard MTU is | case, only a party that's actually using the non-standard MTU is | |||
| affected. This puts potential problems and potential benefits in the | affected. This puts potential problems and potential benefits in the | |||
| same place so it's always possible to revert to a 1500-byte MTU if | same place so it's always possible to revert to a 1500-octet MTU if | |||
| PMTUD problems can't be resolved otherwise. | PMTUD problems can't be resolved otherwise. | |||
| Considering the above and the work that's going on in the IETF to | Considering the above and the work that's going on in the IETF to | |||
| resolve PMTUD issues as they exist today, means that increasing MTUs | resolve PMTUD issues as they exist today, means that increasing MTUs | |||
| where desired doesn't involve undue risks. | where desired doesn't involve undue risks. | |||
| 3.3 Packet loss through bit errors | 3.3 Packet loss through bit errors | |||
| All transmission media are subject to bit errors. In many cases, a bit | All transmission media are subject to bit errors. In many cases, a bit | |||
| error leads to a CRC failure, after which the packet is lost. In other | error leads to a CRC failure, after which the packet is lost. In other | |||
| skipping to change at page 5, line 39 ¶ | skipping to change at page 5, line 36 ¶ | |||
| packet being lost due to errors increases. And when a packet is lost, | packet being lost due to errors increases. And when a packet is lost, | |||
| more data has to be retransmitted. | more data has to be retransmitted. | |||
| Both per-packet overhead and loss through errors reduce the amount of | Both per-packet overhead and loss through errors reduce the amount of | |||
| usable data transferred. The optimum tradeoff is reached when both | usable data transferred. The optimum tradeoff is reached when both | |||
| types of loss are equal. If we make the simplifying assumption that | types of loss are equal. If we make the simplifying assumption that | |||
| the relationship between the bit error rate of a medium and the | the relationship between the bit error rate of a medium and the | |||
| resulting number of lost packets is linear with packet size, the | resulting number of lost packets is linear with packet size, the | |||
| optimum packet size is computed as follows: | optimum packet size is computed as follows: | |||
| packet size = sqrt(overhead bytes / bit error rate) | packet size = sqrt(overhead octets / bit error rate) | |||
| For IPv6 in ethernet framing, with 14 bytes of ethernet header, 40 | For IPv6 in ethernet framing, with 14 octets of ethernet header, 40 | |||
| bytes of IPv6 header, 20 bytes of TCP header and 32 bits of ethernet | octets of IPv6 header, 20 octets of TCP header and 32 bits of ethernet | |||
| CRC the total number of bytes transmitted is 1538 while the useful | CRC the total number of octets transmitted is 1538 while the useful | |||
| data is 1440. (The preamble and inter frame gap are not relevant for | data is 1440. (The preamble and inter frame gap are not relevant for | |||
| error rate purposes.) 78 bytes of overhead would result in a 1518-byte | error rate purposes.) 78 octets of overhead would result in a | |||
| frame length for a bit error rate of 10^-5.3. | 1518-octet frame length for a bit error rate of 10^-5.3. | |||
| Note that the minimum BER for 1000BASE-T is 10^-10, which implies an | Note that the minimum BER for 1000BASE-T is 10^-10, which implies an | |||
| optimum packet size of 312250 bytes. | optimum packet size of 312250 octets. | |||
| In practice, it's better to err on the side of smaller packets and | In practice, it's better to err on the side of smaller packets and | |||
| lower packet loss to avoid triggering TCP congestion mechanisms. | lower packet loss to avoid triggering TCP congestion mechanisms. | |||
| However, it's obvious that current maximum packet sizes are far below | However, it's obvious that current maximum packet sizes are far below | |||
| the optimum size with respect to optimum throughput. | the optimum size with respect to optimum throughput. | |||
| 3.4 Undetected bit errors | 3.4 Undetected bit errors | |||
| Nearly all link layers employ some kind of checksum to detect bit | Nearly all link layers employ some kind of checksum to detect bit | |||
| errors so that packets with errors can be discarded. In the case of | errors so that packets with errors can be discarded. In the case of | |||
| ethernet, this is a frame check sequence in the form of a 32-bit CRC. | ethernet, this is a frame check sequence in the form of a 32-bit CRC. | |||
| The error detecting properties of the CRC are twofold: the minimum | The error detecting properties of the CRC are twofold: the minimum | |||
| Hamming distance and the statistical unlikeliness of two packets | Hamming distance and the statistical unlikeliness of two packets | |||
| resulting in the same CRC. Depending on the size of the packet, there | resulting in the same CRC. Depending on the size of the packet, there | |||
| is a minimum Hamming distance between two possible packets that result | is a minimum Hamming distance between two possible packets that result | |||
| in the same CRC. For ethernet packets between 376 and 11454 bytes long | in the same CRC. For ethernet packets between 376 and 11454 octets | |||
| (including), the Hamming distance is 3 [CRC]. So all packets where | long (including), the Hamming distance is 3 [CRC]. So all packets | |||
| transmission errors resulted in one or two flipped bits are detected. | where transmission errors resulted in one or two flipped bits are | |||
| If 3 or more bits are flipped, most errors are caught because only in | detected. If 3 or more bits are flipped, most errors are caught | |||
| very few cases, the new bit pattern results in the same CRC as the old | because only in very few cases, the new bit pattern results in the | |||
| bit pattern. In theory, the chance of two packets having the same | same CRC as the old bit pattern. In theory, the chance of two | |||
| CRC-32 is 1 in 2^32, but this assumes the CRC is as strong as it | packets having the same CRC-32 is 1 in 2^32, but this assumes the | |||
| possibly could be. | CRC is as strong as it possibly could be. | |||
| It has been suggested that increasing packet lengths reduce the | It has been suggested that increasing packet lengths reduce the | |||
| effectiveness of the CRC-32. For the statistical aspect of the CRC, | effectiveness of the CRC-32. For the statistical aspect of the CRC, | |||
| this isn't true. Again, assuming a linear relationship between the | this isn't true. Again, assuming a linear relationship between the | |||
| likelihood of bit errors in a packet and the bit error rate, doubling | likelihood of bit errors in a packet and the bit error rate, doubling | |||
| the packet size means doubling the chance of a given number of bit | the packet size means doubling the chance of a given number of bit | |||
| errors in the packet. In turn, this doubles the chance of a packet | errors in the packet. In turn, this doubles the chance of a packet | |||
| with bit errors going undetected by the CRC. However, because the | with bit errors going undetected by the CRC. However, because the | |||
| packet is twice as long, only half the number of packets is required | packet is twice as long, only half the number of packets is required | |||
| to transmit any given amount of data. These aspects cancel each other | to transmit any given amount of data. These aspects cancel each other | |||
| skipping to change at page 7, line 4 ¶ | skipping to change at page 6, line 50 ¶ | |||
| having enough bit errors to satisfy a given Hamming distance (packet | having enough bit errors to satisfy a given Hamming distance (packet | |||
| error rate) and then generate the same CRC is: | error rate) and then generate the same CRC is: | |||
| PER = (packet length in bits * BER) ^ H / 2^32 | PER = (packet length in bits * BER) ^ H / 2^32 | |||
| The likelihood of a packet with enough bit errors to meet the Hamming | The likelihood of a packet with enough bit errors to meet the Hamming | |||
| distance and then generate an identical CRC in a transmission of a | distance and then generate an identical CRC in a transmission of a | |||
| certain number of bits is: | certain number of bits is: | |||
| TER = transmission length / packet length * PER | TER = transmission length / packet length * PER | |||
| In other words: | In other words: | |||
| TER = transmission length / (packet length ^ (H - 1) * BER ^ H) / 2^32 | TER = transmission length / (packet length ^ (H - 1) * BER ^ H) / 2^32 | |||
| (Hence the irrelevance of the packet length for a Hamming distance of | (Hence the irrelevance of the packet length for a Hamming distance of | |||
| 1.) | 1.) | |||
| For a 400 GB (approximately one hour) transmission over 1000BASE-T | For a 400 GB (approximately one hour) transmission over 1000BASE-T | |||
| with a BER of 10^-10 and a 1518-byte ethernet frame length this means: | with a BER of 10^-10 and a 1518-octet ethernet frame length this | |||
| means: | ||||
| TER = 3.44*10^12 * 12144 ^ 2 * 10^-10 ^ 3 / 2^32 = 1.18*10^-19 | TER = 3.44*10^12 * 12144 ^ 2 * 10^-10 ^ 3 / 2^32 = 1.18*10^-19 | |||
| For 11454-byte packets this becomes: | For 11454-octet packets this becomes: | |||
| TER = 3.44*10^12 * 91632 ^ 2 * 10^-10 ^ 3 / 2^32 = 6.73*10^-18 | TER = 3.44*10^12 * 91632 ^ 2 * 10^-10 ^ 3 / 2^32 = 6.73*10^-18 | |||
| Please note that this is 14 orders of magnitude better than the naive | Please note that this is 14 orders of magnitude better than the naive | |||
| assumption of a Hamming distance of 1 suggests for standard 1518-byte | assumption of a Hamming distance of 1 suggests for standard 1518-octet | |||
| ethernet frames: | ethernet frames: | |||
| TER = 3.44*10^12 * 12144 ^ 0 * 10^-10 ^ 1 / 2^32 = 9.73*10^-4 | TER = 3.44*10^12 * 12144 ^ 0 * 10^-10 ^ 1 / 2^32 = 9.73*10^-4 | |||
| So the strength of the CRC, assuming a Hamming distance of 3, goes | So the strength of the CRC, assuming a Hamming distance of 3, goes | |||
| down with the square of the factor by which the packet length is | down with the square of the factor by which the packet length is | |||
| increased. And it goes down with the third power of any increase of | increased. And it goes down with the third power of any increase of | |||
| the bit error rate. However, this discussion is largely academic | the bit error rate. However, this discussion is largely academic | |||
| because of the assumption that bit errors happen in isolation. For | because of the assumption that bit errors happen in isolation. For | |||
| instance, 1000BASE-T transmits two bits per symbol over four wire | instance, 1000BASE-T transmits two bits per symbol over four wire | |||
| skipping to change at page 8, line 4 ¶ | skipping to change at page 7, line 49 ¶ | |||
| Larger packets aren't universally desireable. The factors that factor | Larger packets aren't universally desireable. The factors that factor | |||
| into the decision to use larger packets include: | into the decision to use larger packets include: | |||
| - A link's bit error rate | - A link's bit error rate | |||
| - The number of bits per symbol on a link and hence the likelihood of | - The number of bits per symbol on a link and hence the likelihood of | |||
| multiple bit errors in a single packet | multiple bit errors in a single packet | |||
| - The strength of the Frame Check Sequence | - The strength of the Frame Check Sequence | |||
| - The link speed | - The link speed | |||
| - The number of buffers | - The number of buffers | |||
| - Queuing strategy | - Queuing strategy | |||
| This means that choosing a good maximum packet size is, initially at | This means that choosing a good maximum packet size is, initially at | |||
| least, the responsibility of hardware vendors. On top of that, robust | least, the responsibility of hardware vendors. On top of that, robust | |||
| mechanisms must be available to operators to further limit maximum | mechanisms must be available to operators to further limit maximum | |||
| packet sizes where appropriate. | packet sizes where appropriate. | |||
| 4 The protocol mechanisms | 4 The protocol mechanisms | |||
| The basic idea is that nodes are free to negotiate larger MTUs with | The basic idea is that nodes are free to negotiate larger MTUs with | |||
| neighbors. However, to avoid problems, test packets are sent first | neighbors on a subnet. However, to avoid problems, probe packets | |||
| before larger packets are used for actual traffic, and routers and | are sent first before larger packets are used for actual traffic, | |||
| switches may inform nodes of MTU limitations that are best observed | and routers may inform hosts of MTU limitations that should be | |||
| or are mandatory to observe. | observed for three common ranges of link speeds. The rationale for | |||
| having different MTU limitations for different link speeds is that | ||||
| it's common for devices operating at the link layer to support | ||||
| larger MTUs if they support and/or operate at higher link speeds. | ||||
| E.g., a LAN could consist of a gigabit ethernet switch with jumbo | ||||
| frame capabilities connected to a 10/100 Mbps ethernet switch which | ||||
| doesn't support jumbo frames. By limiting the use of oversized | ||||
| packets to nodes operating at 1000 Mbps, the 10/100 Mbps switch | ||||
| isn't exposed to oversized packets which would result in error | ||||
| conditions and use up unnecessary bandwidth. Additionally, it may | ||||
| be desireable to limit packet sizes at lower speeds even if a large | ||||
| MTU is supported for QoS purposes. | ||||
| 4.1 The variable MTU router advertisement option | Additionally, routers send out two flags. One is intended to signal | |||
| hosts to be conservative in the number of probes they transmit to | ||||
| avoid triggering undesired behavior by link-layer devices seeing a | ||||
| large number of out-of-spec packets. The other flag suppresses | ||||
| probing for compatibility with the existing practice where all | ||||
| nodes on a subnet are administratively configured with a | ||||
| non-standard MTU. | ||||
| Probing consists of sending a large neighbor discovery or ARP | ||||
| packet to a neighbor. If the neighbor sends a reply, it managed to | ||||
| successfully receive the probe so the per-neighbor MTU for this | ||||
| neighbor can be set to the size of the probe packet and data | ||||
| packets of that size can now be sent. | ||||
| 4.1 The multi-MTU router advertisement option | ||||
| Routers use this option to inform hosts on connected subnets about the | Routers use this option to inform hosts on connected subnets about the | |||
| maximum allowed MTU for a given link speed and the off-link MTU that | maximum allowed MTU for three ranges of link speeds. | |||
| should be used towards off-link destinations. | ||||
| 1 2 3 | 1 2 3 | |||
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Type | Length | Reserved | | | Type | Length |C|N| Reserved | Pri | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Off-link MTU | | | MAXMTU1000 | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Reserved | Pri | Link speed | | | MAXMTU100 | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Allowed MTU | | | MAXMTU10 | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| Type: TBD | Type: TBD | |||
| Length: 2 | Length: | |||
| 1 or 2. A length of more than 2 indicates a future extension with | ||||
| additional fields and MUST NOT be treated as an error, the | ||||
| additional fields MUST be ignored. | ||||
| Reserved: 0 on transmission, ignored on reception. | C: | |||
| "Conservative" flag: when set, nodes should reduce the number of | ||||
| large packets sent by using a conservative timings and probing | ||||
| algorithms, if possible avoiding sending more than one | ||||
| unsuccessful probe per 60 seconds. When the flag is cleared, | ||||
| nodes may send send several oversized packets per second when | ||||
| probing. | ||||
| Off-link MTU: | N: | |||
| This is the maximum packet size that a router can forward to other | "No probe" flag: when set to 0, hosts MUST probe before using | |||
| links it connects to. Hosts SHOULD use a TCP MSS option based on | oversized packets towards a neighbor. When set to 1, hosts MUST | |||
| this value in all TCP sessions and limit packets sent to off-link | NOT send probes and use the relevant MAXMTU field as their MTU. | |||
| destinations to this maximum. The off-link MTU must be at least | If MAXMTU is larger than the physical MTU, an error is logged. | |||
| 1280. A value of 0 means the off-link MTU is undefined and hosts | ||||
| should use their physical MTU in TCP MSS options and limit packets | Reserved: 0 on transmission, ignored on reception. | |||
| sent to routers to the maximum MTU the router supports as | ||||
| discovered through the neighbor discovery option. | ||||
| Pri: | Pri: | |||
| Priority. Values have the following meaning: | Priority. Values have the following meaning: | |||
| 000: Vendor default | 000: Vendor default | |||
| 001: Local override of 000 | 001: Local override of 000 | |||
| 010: Site default | 010: Site default | |||
| 011: Local override of 010 | 011: Local override of 010 | |||
| 100: Subnet default | 100: Subnet default | |||
| 101: Local override of 100 | 101: Local override of 100 | |||
| skipping to change at page 9, line 19 ¶ | skipping to change at page 10, line 4 ¶ | |||
| 001: Local override of 000 | 001: Local override of 000 | |||
| 010: Site default | 010: Site default | |||
| 011: Local override of 010 | 011: Local override of 010 | |||
| 100: Subnet default | 100: Subnet default | |||
| 101: Local override of 100 | 101: Local override of 100 | |||
| 110: Per-node setting | 110: Per-node setting | |||
| 111: Local override of 110 | 111: Local override of 110 | |||
| Vendors may only use priority 000 in default configurations. | Vendors may only use priority 000 in default configurations. | |||
| Site-wide administrative settings may only use 000 and 010. | Site-wide administrative settings may only use 000 and 010. | |||
| Subnet-specific administrative settings may use 000, 010 or 110, | Subnet-specific administrative settings may use 000, 010 or 110, | |||
| but not 001, 011, 101 or 111. | but not 001, 011, 101 or 111. | |||
| Link speed: | MAXMTU1000: | |||
| Minimum link speed the option may apply to. Values from 0 to 49151 | The maximum packets size allowed on a link operating at a speed | |||
| indicate a link speed in megabits per second. Values from 49152 to | of 300 Mbps or more. Packets larger than this value SHOULD NOT | |||
| 65535 are reserved for future use, but imply a link speed of more | be sent over the link in question. The MAXMTU1000 MUST be at | |||
| than 49151 Mbps. Hosts MUST ignore all options with a link speed | least the MTU size specified in the relevant IPv6-over-... RFC. | |||
| value that's higher than the current link speed of the interface | A value of 0 means that the MTU size is undefined and no | |||
| the option is received over. For instance, if a host has an | maximum size is enforced for this link speed. | |||
| interface that supports 10, 100 and 1000 Mbps ethernet which | ||||
| currently operates at 100 Mbps, and the host receives options | ||||
| with link speed values of 100 and 1000 over that interface, the | ||||
| option with the link speed of 100 is processed and the option | ||||
| with the link speed of 1000 is ignored. | ||||
| Allowed MTU: | MAXMTU100: | |||
| The maximum packets size allowed on a link. Packets larger than | The maximum packets size allowed on a link operating at a speed | |||
| this value MUST NOT be sent over the link in question. The allowed | of 30 to 299 Mbps and links operating at an unknown speed if | |||
| MTU MUST be at least 1500. A value of 0 means that the allowed | that speed can be 30 Mbps or higher. Packets larger than | |||
| MTU is undefined and no maximum MTU is enforced. | this value SHOULD NOT be sent over the link in question. The | |||
| MAXMTU100 MUST be at least the MTU size specified in the | ||||
| relevant IPv6-over-... RFC. A value of 0 means that the MTU | ||||
| size is undefined and no maximum size is enforced for this link | ||||
| speed. | ||||
| The number of variable MTU options in router advertisements is limited | MAXMTU10: | |||
| to a maximum of 4. | The maximum packets size allowed on a link operating at a speed | |||
| of less than 30 Mbps. Packets larger than this value SHOULD NOT | ||||
| be sent over the link in question. The MAXMTU10 MUST be at | ||||
| least the MTU size specified in the relevant IPv6-over-... RFC. | ||||
| A value of 0 means that the MTU size is undefined and no | ||||
| maximum size is enforced for this link speed. | ||||
| Hosts are expected to recover the variable MTU options from the router | When MAXMTU1000, MAXMTU100 and MAXMTU10 all contain the same value, | |||
| it is allowed to omit MAXMTU100 and MAXMTU10 so the option has a | ||||
| length of 1 (8 octets) rather than 2 (16 octets). The receiver of | ||||
| the option should treat the shorter option the same as a full lenth | ||||
| option where the three MAXMTU fields all contain the value from | ||||
| MAXMTU1000. | ||||
| Hosts are expected to recover the multi-MTU options from the router | ||||
| advertisements of at least the router they select as a default router, | advertisements of at least the router they select as a default router, | |||
| but it's allowed (not required) to recover options from multiple | but it's encouraged (not required) to recover options from multiple | |||
| routers. The same option, or data constituting the same information, | routers. The same option, or data constituting the same information, | |||
| may be learned from other sources, such as local configuration and/or | may be learned from other sources, such as local configuration and/or | |||
| DHCPv6. Host MUST only consider variable MTU options where the value | DHCPv6. Hosts SHOULD use the MAXMTU value relevant for the link | |||
| of the link speed field doesn't exceed that of the current link speed | speed the interface is currently operating at from the option or | |||
| of the associated interface. Any options (or equivalents) that satisfy | equivalent information with the largest priority value. If the | |||
| this condition are ordered by the priority, link speed and allowed MTU | relevant MAXMTU field is unspecified (zero) in the option or | |||
| fields, in that order. Hosts SHOULD copy the allowed MTU and off-link | information with the highest priority, the field from the option | |||
| MTU information, if specified, from the option (or equivalent) with | or information with the next highest priority is considered, and | |||
| the largest value for the concatenation of these three fields. | so on. If no information is available because no option or | |||
| equivalent is available, or the relevant MAXMTU field never has a | ||||
| 4.2 Changes to the RA MTU option semantics | non-zero value, the host SHOULD use its physical MTU as the | |||
| MAXMTU. | ||||
| Hosts are currently supposed to ignore an MTU of more than 1500 in the | ||||
| MTU option in router advertisements on ethernet links [RFC2464]. This | ||||
| makes it impossible to use an MTU larger than 1500 bytes for multicast | ||||
| packets. In order to lift this limitation, routers and hosts that | ||||
| implement variable MTU subnets may advertise and accept, respectively, | ||||
| an MTU option with an MTU larger than 1500. Hosts should use the | ||||
| minimum of the maximum feasible MTU and the MTU in the RA MTU option | ||||
| for the transmission of multicast packets. | ||||
| Note that advertising an MTU option larger than 1500 can only work on | ||||
| subnets where all the hosts implement variable MTU subnets. | ||||
| 4.3 The switch MTU advertisement message | ||||
| Switches and other layer 2 devices MAY advertise the maximum MTU they | ||||
| support in an ICMPv6 [RFC2463] message sent to multicast address TBD. | ||||
| The format of this ICMPv6 message is as follows: | ||||
| 1 2 3 | ||||
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Type | Code | Checksum | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Number of MTUs | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | | | ||||
| + Switch identifier + | ||||
| | | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Reserved | Link speed 1 | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Advised MTU 1 | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Reserved | Link speed 2 | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Advised MTU 2 | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| ... | ||||
| | Reserved | Link speed N | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Advised MTU N | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| Type: TBD (informational) | ||||
| Code: TBD | ||||
| Checksum: see [RFC2463] | ||||
| Number of MTUs: | ||||
| Number of times the reserved/link speed/advised MTU fields are | ||||
| repeated for different link speed values. The minimum is 1, the | ||||
| maximum 4. | ||||
| Switch identifier: a 64-bit value that is unique to the switch. | ||||
| Reserved: 0 on transmission, ignored on reception. | ||||
| Link speed: | When a node's interface speed changes, it MAY reinitiate | |||
| Minimum link speed the option may apply to. Values from 0 to 49151 | negotiation of per-neighbor MTUs, but it SHOULD remain prepared to | |||
| indicate a link speed in megabits per second. Values from 49152 to | receive packets of the maximum size indicated to neighbors | |||
| 65535 are reserved for future use, but imply a link speed of more | previously. | |||
| than 49151 Mbps. Hosts MUST ignore all options with a link speed | ||||
| value that's lower than the current link speed of the interface | ||||
| the option is received over. Note that this is the opposite | ||||
| behavior of that specified for the link speed in the RA variable | ||||
| MTU option. | ||||
| Advised MTU: | Devices not acting as IPv6 routers that need to inform hosts on the | |||
| The IPv6 MTU the switch supports on ports operating at the | local subnet of MTU limitations MAY send out a router advertisement | |||
| indicated link speed. In the case of ethernet, the IPv6 MTU is the | with a Router Lifetime of 0 [RFC2461] and the pertinent information | |||
| maximum frame size after subtracting the size of the VLAN tag, the | in a multi-MTU option. | |||
| 14-byte Ethernet II header and the frame check sequence. | ||||
| Switch MTU advertisements should be sent out at 5-minute intervals. | 4.2 Changes to the RA MTU option semantics | |||
| When a port transitions from an inactive or disconnected to an active | ||||
| state, the interval MAY be reduced to 60 seconds, such that if it has | ||||
| been 60 seconds or longer ago that the last switch MTU advertisement | ||||
| was sent out, a switch MTU advertisement is sent out immediately. | ||||
| If the switch doesn't otherwise implement IPv6, or the IPv6 protocol | Hosts are currently supposed to ignore an MTU of more than 1500 in | |||
| is inactive, the IPv6 source address should be the unspecified | the MTU option in router advertisements on ethernet links | |||
| address. Since all the information in the message is thus known in | [RFC2464]. This makes it impossible to use an MTU larger than 1500 | |||
| advance, the entire message, including the checksum, may be | octets for multicast packets. In order to lift this limitation, | |||
| pre-calculated without the need to implement IPv6 in the switch. | routers and hosts that implement multi-MTU subnets may advertise | |||
| and accept, respectively, an MTU option with an MTU larger than | ||||
| 1500. Hosts should use the minimum of the MAXMTU for their link | ||||
| speed and the MTU in the RA MTU option for the transmission of | ||||
| multicast packets. | ||||
| Host SHOULD monitor switch MTU advertisement messages, using the | Note that advertising an MTU option larger than 1500 can only work on | |||
| switch identifier field to detect refreshes/duplicates, and retain all | subnets where all the hosts implement multi-MTU subnets. | |||
| switch MTU advertisements for 10 minutes. When the switch MTU | ||||
| advertisement information changes (new advertisements, new information | ||||
| in previously known advertisements, advertisements expire), hosts | ||||
| SHOULD select the minimum advised MTU value where the associated link | ||||
| speed is equal to or higher than the current link speed on the | ||||
| associated interface. The thusly recovered advised MTU for the link is | ||||
| the minimum of the MTUs supported by all the switches for this | ||||
| particular link speed if all switches implement the switch MTU | ||||
| advertisement mechanism. | ||||
| 4.4 The neighbor discovery MTU option | 4.3 The IPv6 neighbor discovery MTU and padding options | |||
| A node that implements the variable MTU subnet capability SHOULD | A node that implements the multi-MTU subnet capability SHOULD | |||
| include an MTU option in both neighbor solicitation and neighbor | include an MTU option in both neighbor solicitation and neighbor | |||
| advertisement messages [RFC2461]. A node MAY omit the option if the | advertisement messages [RFC2461]. A node MAY omit the option if the | |||
| use of a larger MTU isn't desired at that time or if the MTU it would | use of a larger MTU isn't desired at that time or if the MTU it would | |||
| advertise is equal to or lower than the MTU that would otherwise be | advertise is equal to or lower than the MTU that would otherwise be | |||
| used. However, there is no requirement to omit the option depending on | used. However, there is no requirement to omit the option depending on | |||
| the value of the different MTU variables as the receiver must | the value of the different MTU variables as the receiver must | |||
| implement the logic required to determine which MTU to use anyway. | implement the logic required to determine which MTU to use anyway. | |||
| The format of the neighbor discovery MTU option is as follows: | The format of the neighbor discovery MTU option is as follows: | |||
| skipping to change at page 12, line 29 ¶ | skipping to change at page 12, line 4 ¶ | |||
| The format of the neighbor discovery MTU option is as follows: | The format of the neighbor discovery MTU option is as follows: | |||
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Type | Length | Reserved | | | Type | Length | Reserved | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | MTU | | | MTU | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| Type: TBD | Type: TBD | |||
| Length: 1 | Length: 1 | |||
| Reserved: set to 0 on transmission, ignored on reception. | Reserved: set to 0 on transmission, ignored on reception. | |||
| MTU: | MTU: | |||
| The maximum packet size the node is prepared to send and receive, | The maximum packet size in octets that the node is prepared to | |||
| which is copied from the local MTU. The minimum valid value is | receive. The minimum valid value is 1280. | |||
| 1280. | ||||
| Reception of a neighbor solicitation or a neighbor advertisement | ||||
| triggers the sending of an ICMPv6 MTU detection message. | ||||
| The MTU detection message | ||||
| Since it's possible that there are layer 2 devices that don't | The format of the neighbor discovery MTU option is as follows: | |||
| implement the switch MTU advertisement message in the path between two | ||||
| nodes, it's necessary to make that it is indeed possible to send and | ||||
| receive packets larger than the standard MTU. This is what the ICMPv6 | ||||
| MTU detection message is for. It has the following format: | ||||
| 1 2 3 | ||||
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Type | Code | Checksum | | | Type | Length |R| Reserved | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| |R| Reserved | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| | Packet size | | ||||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| | Padding | | | Padding | | |||
| ... | ~ ~ | |||
| | | | | | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| Type: TBD (informational) | Type: TBD | |||
| Code: TBD | Length: see below. | |||
| Checksum: see [RFC2463] | R: reply flag. | |||
| R (reply requested): 0: no reply requested, 1: reply requested | Reserved: set to 0 on transmission, ignored on reception. | |||
| Reserved: 0 on transmission, ignored on reception | Padding: 0 or more all-zero octects. | |||
| Packet size: | The MTU option is included in all neighbor advertisement and | |||
| Size of this packet, including IPv6 and other headers. A value of | neighbor solicitation messages. | |||
| 0 indicates no padding is present and the size of the packet | ||||
| shouldn't be considered. | ||||
| Padding: | Reception of a neighbor solicitation or a neighbor advertisement | |||
| 0 or more 0 bytes to bring the packet to the specified packet | triggers for a neighbor for which no per-neighbor MTU is known | |||
| size. | triggers, in addition to the normal response if it's a neighbor | |||
| solicitation, the sending of an neighbor solicitation message wih | ||||
| the MTU and padding options in it. The size of this message is may | ||||
| vary between the IPv6-over-... size + 1 for the link and the | ||||
| minimum of the relevant MAXMTU, the physical MTU and the neighbor's | ||||
| MTU as advertised in the MTU option of the packet received. See | ||||
| below for considerations about the packet sizes to choose. The | ||||
| padding option is used to bring the neighbor solicitation message | ||||
| to this size. The padding option MUST be the last option in the | ||||
| packet. | ||||
| In order to avoid sending large numbers of packets that can't be | There are two possible ways to determine the value of the length | |||
| handled properly by switches or other layer 2 devices, after sending a | field: | |||
| large MTU detection packet, no other maximum size MTU detection | ||||
| packets may be transmitted on the same interface for 60 seconds or | ||||
| until a large MTU detection packet has been received, whichever | ||||
| happens first. In this context, "large" means larger than the standard | ||||
| MTU size for the link type, i.e., 1500 bytes for ethernet. | ||||
| When variable MTU subnet capability is detected for a neighbor by the | 1. Set it to 0. As the "length" field in options has a granularity | |||
| presence of an MTU option in a neighbor solicitation or neighbor | of 8 octets and the behavior of nodes when they receive a | |||
| discovery message, an MTU detection message is constructed as follows: | neighbor solicitation packet which has a total length that | |||
| doesn't match the length of the packet contents, an option | ||||
| length of 0 is used to make sure that hosts that don't | ||||
| understand the padding option will silently discard the packet. | ||||
| R: | 2. If the intended packet length allows a valid value for the | |||
| Set to 0 if the neighbor MTU is known and confirmed, set to 1 | length field, the length field MAY be set to that value. The | |||
| otherwise. | node MAY reduce the size of the intended packet to accommodate | |||
| the requirement that the size field is a multiple of 8 octets. | ||||
| I.e., if the intended packet size is 4470 octets with 40 and 24 | ||||
| octets for the IPv4 and neighbor solicitation headers, | ||||
| respectively, the padding option would have to be 4406 octets | ||||
| long, which can't be expressed in the length field. The node may | ||||
| choose to use a packet size of 4464 instead, which results in a | ||||
| length field value of 550. | ||||
| Packet size: | A neighbor solicitation message with the padding option is always | |||
| Equal to the minimum of the local MTU and the (tentative) neighbor | sent in addition to a regular neighbor solicitation message, rather | |||
| MTU. | than in place of one. | |||
| When an MTU detection packet is received, the size of the packet is | When a node receives a neighbor solicitation message with the | |||
| checked against the value in the packet size field to detect | padding option, it stops evaluating options when it reaches the | |||
| truncation in transit. If the packet size and the packet size field | padding option and returns a regular neighbor advertisement | |||
| don't match, or if the packet size is smaller than 1280 bytes, the | message, which includes the MTU option with the R flag set to 1. | |||
| message is silently discarded. | Whenever the neighbor advertisement is not the result of receiving | |||
| a neighbor solicitation with a padding option, the R flag is set to | ||||
| 0. | ||||
| If the received message has the R flag set to 1, a reply is | When a node receives a neighbor advertisement message, it must | |||
| constructed as follows: | determine whether the message is in reaction to a locally sent | |||
| neighbor solicitation with the padding option or not. If the MTU | ||||
| option is included in the message received, an R flag of 1 | ||||
| indicates that it is indeed a reply. In the absense of the MTU | ||||
| option the node must use heuristics relating to the timing of the | ||||
| messages it sent with and without the option, and the reception of | ||||
| the current message. If the message was a reply, the node sets the | ||||
| neighbor MTU to the size of the neighbor solicitation message that | ||||
| was replied to. | ||||
| R: 0 | If no reply is received after some time, either the neighbor is | |||
| incapable of receiving packets of the size that was used, or a | ||||
| device operating at the link layer was incapable for forwarding the | ||||
| frame. (Incidental packet loss is also a possibility.) In order to | ||||
| determine a workable MTU even in the presence of unknown | ||||
| limitations, a node may repeat sending a solicitation with the | ||||
| padding option. However, since presumably, some equipment may react | ||||
| badly to a large number of out-of-spec packets, it's important that | ||||
| nodes adjust their behavior in the presence of the C (conservative) | ||||
| flag in router advertisements. | ||||
| Packet size: | The above allows for two strategies in determining a neighbor's | |||
| Equal to the minimum of the local MTU and the neighbor MTU. | MTU: the node can depend on the presence of these mechanisms | |||
| described in this document, including setting the padding option | ||||
| length field to 0, or it can try to interoperate with nodes that do | ||||
| have the capability of using larger packet sizes, but don't | ||||
| implement any of the mechanisms described. In that case, the | ||||
| padding option must conform to [RFC2461] and care must be taken to | ||||
| avoid overly aggressive probing of nodes that do not support larger | ||||
| packets. | ||||
| The neighbor MTU overrules information in the TCP MSS option in TCP | Nodes MUST support reception of both types of probes, but MAY be | |||
| sessions towards that neighbor. Neighbor MTU information expires along | limited to generating only one type. | |||
| with link addresses learned through neighbor discovery and upon dead | ||||
| neighbor detection. | ||||
| 4.5 Determining the local MTU | 4.4 IPv4 ethernet jumbo ARP message | |||
| The local MTU is the value communicated to neighbors. It is the | Due to lack of neighbor discovery, with IPv4, it's necessary to use | |||
| minimum of the physical MTU for an interface and the allowed MTU as | ARP to probe for non-standard MTU capabilities. This is done by | |||
| advertised by a router or learned through other means. The local MTU | simply probing with an ARP packet padded to the desired size. If a | |||
| may be further reduced by the reception of switch MTU advertisements. | reply comes back, the neighbor supports the probed MTU size. | |||
| 4.5 Probe considerations | ||||
| In cases where the neigbor's MTU was advertised in an MTU option, | ||||
| it makes sense to try with this size. If that probe fails or the | ||||
| neighbor's MTU is unknown, the best choice for a probe size would | ||||
| be the smallest possible non-standard MTU. This could be the | ||||
| IPv6-over-... RFC's MTU size + 1, or a slightly larger value that | ||||
| represents the first larger size that is actually useful, such as | ||||
| 1508 or 1520 for ethernet. Failure at this size wastes relatively | ||||
| little bandwidth and indicates that further probes are unnecessary. | ||||
| If this probe is successful, further choices for the probe size may | ||||
| be common MTU sizes such as 1508, 1530, 1536, 1546, 1998, 2000, | ||||
| 2018, 4464, 4470, 8092, 8192, 9000, 9176, 9180, 9216, 17976, 64000 | ||||
| and 65280 octets. | ||||
| There is no requirement that a node tries a number of probes of | ||||
| different sizes; only that before oversized packets are sent, a | ||||
| reply for a probe of that size or larger MUST have been received | ||||
| from the neighbor in question, unless the N flag is set to 1. A | ||||
| simple strategy that would be appropriate when the C flag is set to | ||||
| 1, but may also be used otherwise, would be to initially send just | ||||
| one probe sized at the local MTU value, and if unsuccessful, only | ||||
| send a second probe when a probe from the neighbor is received. The | ||||
| second probe is made the same size as the neighbor's probe. | ||||
| Probes MUST be sent as unicast. | ||||
| 4.6 Neighbor MTU garbage collection | ||||
| The MTU size for a neighbor is garbage collected along with a | ||||
| neighbor's link address in accordance with regular ARP and neighbor | ||||
| discovery timeouts. Additionally, a neighbor's MTU size is reset to | ||||
| unknown after dead neighbor detection declares a neighbor "dead". | ||||
| 5 References | 5 References | |||
| 5.1 Normative References | 5.1 Normative References | |||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
| [RFC2461] Narten, T., Nordmark, E., and W. Simpson, "Neighbor | [RFC2461] Narten, T., Nordmark, E., and W. Simpson, "Neighbor | |||
| Discovery for IP Version 6 (IPv6)", RFC 2461, | Discovery for IP Version 6 (IPv6)", RFC 2461, | |||
| skipping to change at page 15, line 7 ¶ | skipping to change at page 15, line 34 ¶ | |||
| Autoconfiguration", RFC 2462, December 1998. | Autoconfiguration", RFC 2462, December 1998. | |||
| 5.2 Informative References | 5.2 Informative References | |||
| [CRC] Jain, R., ""Error Characteristics of Fiber Distributed | [CRC] Jain, R., ""Error Characteristics of Fiber Distributed | |||
| Data Interface (FDDI)", IEEE Transactions on | Data Interface (FDDI)", IEEE Transactions on | |||
| Communications, August 1990. | Communications, August 1990. | |||
| 6 Document and Author Information | 6 Document and Author Information | |||
| This document expires December, 2007. The latest version will always | This document expires February, 2008. The latest version will always | |||
| be available at http://www.muada.com/drafts/. Please direct questions | be available at http://www.muada.com/drafts/. Please direct questions | |||
| and comments to the ipv6 or int area mailinglists or directly to the | and comments to the ipv6 or int area mailinglists or directly to the | |||
| author: | author: | |||
| Iljitsch van Beijnum | Iljitsch van Beijnum | |||
| Email: iljitsch@muada.com | Email: iljitsch@muada.com | |||
| Full Copyright Statement | Full Copyright Statement | |||
| End of changes. 84 change blocks. | ||||
| 288 lines changed or deleted | 309 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||