idnits 2.17.1 draft-nandy-singla-utkarsh-pim-mcast-path-mtu-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (April 19, 2020) is 1465 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'RFC4664' on line 116 -- Looks like a reference, but probably isn't: 'RFC4665' on line 116 -- Looks like a reference, but probably isn't: 'RFC2119' on line 159 -- Looks like a reference, but probably isn't: 'RFC1191' on line 313 == Unused Reference: '2' is defined on line 303, but no explicit reference was found in the text == Unused Reference: '3' is defined on line 306, but no explicit reference was found in the text == Unused Reference: '4' is defined on line 308, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. '3' -- Possible downref: Non-RFC (?) normative reference: ref. '4' -- Possible downref: Non-RFC (?) normative reference: ref. '5' Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT Tathagata Nandy 2 Intended Status: Proposed Standard HPE 3 Nitin Singla 4 HPE 5 Utkarsh Srivastava 6 HPE 7 Expires: 19 October 2020 April 19, 2020 9 Multicast Path MTU 10 draft-nandy-singla-utkarsh-pim-mcast-path-mtu-00 12 Abstract 13 Path MTU discovery (rfc1191) is a standard technique to determine 14 the supported MTU between two Internet Protocol (IP) hosts to avoid 15 any fragmentation. In a multicast distribution tree, source will 16 not know where the receivers are located. So the technique used to 17 compute the path MTU for a unicast stream does not work in a 18 multicast network. This document describes a method to discover 19 multicast path MTU with the goal to avoid traffic loss. This 20 solution also aims to solve the problem of traffic loss in for 21 multicast streams because of incorrect MTU setting and no path MTU 22 support for multicast networks. 24 Status of This Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at https://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six 34 months and may be updated, replaced, or obsoleted by other 35 documents at any time. It is inappropriate to use Internet-Drafts 36 as reference material or to cite them other than as "work in 37 progress." 39 This Internet-Draft will expire on 12 October 2020. 41 Copyright Notice 43 Copyright (c) 2020 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 48 license-info) in effect on the date of publication of this 49 document. Please review these documents carefully, as they 50 describe your rights and restrictions with respect to this 51 document. Code Components extracted from this document must include 52 Simplified BSD License text as described in Section 4.e of the 53 Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Conventions used in this document . . . . . . . . . . . . . 3 60 3. Problem Statement . . . . . . . . . . . . . . . . . . . . . 4 61 4. Multicast Data Path . . . . . . . . . . . . . . . . . . . . 5 62 4.1. FHR to RP . . . . . . . . . . . . . . . . . . . . . . . 5 63 4.2. Generic Routing . . . . . . . . . . . . . . . . . . . . 5 64 4.3. LHR to Host . . . . . . . . . . . . . . . . . . . . . . 6 65 5. Security Considerations . . . . . . . . . . . . . . . . . . 6 66 6. IANA considerations . . . . . . . . . . . . . . . . . . . . 6 67 7. References . . . . . . . . . . . . . . . . . . . . . . . . 7 68 7.1. Normative References . . . . . . . . . . . . . . . . . 7 69 7.2. Informative References . . . . . . . . . . . . . . . . 7 70 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 8 71 Author's Address . . . . . . . . . . . . . . . . . . . . . . . 8 73 1. Introduction 74 When one IP host has a large amount of data to send to another 75 host, the data is transmitted as a series of IP datagrams. It is 76 usually preferable that these datagrams be of the largest size that 77 does not require fragmentation anywhere along the path from the 78 source to the destination. (For the case against fragmentation, 79 see [5].) This datagram size is referred to as the Path MTU (PMTU), 80 and it is equal to the minimum of the MTUs of each hop in the path. 81 A shortcoming of the current Internet protocol suite is the lack of 82 a standard mechanism for a host to discover the PMTU of an 83 arbitrary path. Note: The Path MTU is what in [1] is called the 84 "Effective MTU for sending" (EMTU_S). A PMTU is associated with a 85 path, which is a particular combination of IP source and 86 destination address and perhaps a Type-of-service (TOS). The 87 current practice [1] is to use the lesser of 576 and the first-hop 88 MTU as the PMTU for any destination that is not connected to the 89 same network or subnet as the source. In computer networking, 90 multicast is group communication where data transmission is 91 addressed to a group of destination computers simultaneously. 92 Multicast can be one-to-many or many-to-many distribution. 93 Multicast should not be confused with physical layer 94 point-to-multipoint communication. Ethernet frames with a value of 95 1 in the least-significant bit of the first octet of the 96 destination address are treated as multicast frames and are flooded 97 to all points on the network. This mechanism constitutes multicast 98 at the data link layer. This mechanism is used by IP multicast to 99 achieve one-to-many transmission for IP on Ethernet networks. 100 Modern Ethernet controllers filter received packets to reduce CPU 101 load, by looking up the hash of a multicast destination address in 102 a table, initialized by software, which controls whether a 103 multicast packet is dropped or fully received. IP multicast is a 104 technique for one-to-many communication over an IP network. The 105 destination nodes send Internet Group Management Protocol join and 106 leave messages, for example in the case of IPTV when the user 107 changes from one TV channel to another. Multicast uses network 108 infrastructure efficiently by requiring the source to send a packet 109 only once, even if it needs to be delivered to a large number of 110 receivers. The nodes in the network take care of replicating the 111 packet to reach multiple receivers only when necessary. 113 2. Conventions used in this document 114 2.1. Terminology 115 The reader is assumed to be familiar with the terminology, 116 reference models, and taxonomy defined in [RFC4664] and [RFC4665]. 117 For readability purposes, we repeat some of the terms here. 118 Moreover, we also propose some other terms needed when IP multicast 119 support is discussed. 121 Multicast domain 122 An area in which multicast data is transmitted. In this 123 document, this term has a generic meaning that can refer to 124 Layer-2 and Layer-3. Generally, the Layer-3 multicast domain is 125 determined by the Layer-3 multicast protocol used to establish 126 reachability between all potential receivers in the 127 corresponding domain. The Layer-2 multicast domain can be the 128 same as the Layer-2 broadcast domain (i.e., VLAN), but it may be 129 restricted to being smaller than the Layer-2 broadcast domain if 130 an additional control protocol is used. 132 PIM-SM 133 Protocol Independent Multicast Sparse Mode (PIM-SM) is a family 134 of multicast routing protocols for Internet Protocol (IP) 135 networks that provide one-to-many and many-to-many distribution 136 of data over a LAN, WAN or the Internet. It explicitly builds 137 unidirectional shared trees rooted at a rendezvous point (RP) 138 per group, and optionally creates shortest-path trees per 139 source. PIM-SM uses shared trees by default and implements 140 source-based trees for efficiency; it assumes that no hosts want 141 the multicast traffic unless they specifically ask for it. 142 Senders first send the multicast data to the RP, which in turn 143 sends the data down the shared tree to the receivers. 145 RP 146 Rendezvous Point (RP) is a router in a multicast network domain 147 that acts as a shared root for a multicast shared tree. Any 148 number of routers can be configured to work as RPs and they can 149 be configured to cover different group ranges. An RP acts as the 150 meeting place for sources and receivers of multicast data. In a 151 PIM-SM network, sources must send their traffic to the RP. This 152 traffic is then forwarded to receivers down a shared 153 distribution tree. 155 2.2. Conventions 156 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 157 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 158 this document are to be interpreted as described in [RFC2119]. 160 3. Problem Statement 161 3.1. Motivation 162 Path MTU discovery computes the lowest MTU supported between two 163 hosts to avoid IP fragmentation. For a unicast packet, source 164 device sends out a packet with Don't Fragment (DF) flag bit set in 165 the IP header [1]. Any device along the path whose MTU is 166 smaller than the packet will drop the packet and send back an ICMP 167 Packet Too Big (Type 2) message containing its MTU, allowing the 168 source host to reduce its Path MTU appropriately. The process is 169 repeated until the MTU is small enough to traverse the entire path 170 without fragmentation. In a multicast distribution tree, the 171 source does not know the host for a multicast group till the 172 complete multicast tree is built. Hosts in different branches of 173 the tree use IGMP/MLD followed by PIM to become part of the 174 multicast tree. Generally the process starts at the host where it 175 sends a request to become part of a multicast tree through IGMP 176 joins. The same request is sent to the RP and there by source and 177 group develop a common path. So the technique mentioned above may 178 not work for multicast flows. 180 3.2. Scalability 181 Most routers doesn't send ICMP (unreachable; fragmentation needed) 182 messages in response to too-big IPv4 multicast packets with DF-bit 183 set. They're just dropping these packets silently, breaking PMTUD. 184 This is a case of as-per-design feature and is updated in section 185 7.2 of RFC 1112 that an ICMP error message (Destination 186 Unreachable, Time Exceeded, Parameter Problem, Source Quench, or 187 Redirect) is never generated in response to a datagram destined to 188 an IP host group. The same document also describes why RFC 1112 189 prohibits sending ICMP error messages in response to multicast 190 datagrams. The processing done on ICMP error replies by the *nix 191 socket API might block the sender socket if an error comes back 192 from a single receiver or if TTL expires when traversing a 193 particularly long branch of the multicast tree, not exactly a good 194 idea in multicast environment. 196 4. Multicast Data Path 197 The multicast Stream between a Source and a Host for a particular 198 Group uses the following path. 200 1. Source Router sends PIM Register Packets to the Rendezvous 201 Point (RP) Router with the Source encapsulated in it. This is a 202 Unicast Packet. 204 2. Host Router Sends PIM Joins to the RP and from there the 205 Source and the Core based tree is built. 207 4.1 First hop Source router and rendezvous point pre-Registration 208 For the network segment between the first hop router and the PIM 209 Rendezvous point (RP), multicast data packets are encapsulated into 210 PIM register messages. PIM Register messages are unicast messages 211 and the standard Path MTU discovery technique will work for this 212 segment. 214 4.2 Multicast Flow and PMTU 215 For other segments in the network, data will be sent as multicast 216 packets and the following sequence is used to determine the path 217 MTU for different branches in the multicast tree: 219 1. A new multicast flow received on any router will not have any 220 match in the multicast routing table and hence it is treated 221 as unknown multicast flow. Such streams are copied to CPU to 222 program the flows in HW. 224 2. When the Packet is processed by multicast process to program an 225 unknown flow it computes the Outgoing interfaces list (Olist) 226 for the flow based on IGMP/MLD joins or PIM joins from 227 downstream Routers. 229 3. The proposal is for each interfaces in the Olist, an additional 230 check is performed where the MTU supported on the interface is 231 compared with the size of the multicast data packet. If the 232 packet size is greater than the supported MTU, an ICMP 233 Fragmentation Needed (Type 3, Code 4) message containing its 234 MTU, allowing the source DR to re-compute MTU appropriately. 235 This is done irrespective of whether DF bit is set or not. 237 4. An error message will be logged in each of the Routers 238 performing this check. Optionally an SNMP trap can also be 239 send. This would lead the admin to either change the MTU of the 240 Interfaces for the Multicast Data to go through or the Source 241 DR to fragment and send the Data. 243 5. Optionally as per implementation, some routers can program the 244 Mroute Entry with Error displaying that the packets might be 245 dropped because of large size. This could be implementation 246 specific. 248 6. Optionally, in all the Routers where this check is performed, 249 the unknown Multicast Data packet can be programmed as a bridge 250 entry in Hardware such that no further packets reach the CPU. 252 7. This computation is done at the Connection establishment phase 253 itself for the PIM-SM network such that the Mroute Entry is 254 never programmed in Hardware without the MTU computation. 256 4.3 Last Hop Router to the Host MTU 257 The Host sends IGMP Joins to join a particular group and when 258 unknown multicast is received at the router, it would compute the 259 MTU for those joined paths and would send an ICMP error packet back 260 to the source if there is a violation. 262 1. Source host will learn about the lowest MTU supported among all 263 the branches of the multicast tree and uses the updates the 264 size of the datagrams accordingly. 266 2. This path is same as the previous section only, the only 267 difference is that Joins are not PIM Joins but IGMP Joins. 269 5 IANA Considerations 270 This memo includes no request to IANA. 272 6 Security Considerations 273 This Path MTU Discovery mechanism makes possible two 274 denial-of-service attacks, both based on a malicious party sending 275 false Datagram Too Big messages to an Internet host. In the first 276 attack, the false message indicates a PMTU much smaller than 277 reality. This should not entirely stop data flow, since the victim 278 host should never set its PMTU estimate below the absolute minimum, 279 but at 8 octets of IP data per datagram, progress could be slow. 280 In the other attack, the false message indicates a PMTU greater 281 than reality. If believed, this could cause temporary blockage as 282 the victim sends datagrams that will be dropped by some router. 283 Within one round-trip time, the host would discover its mistake 284 (receiving Datagram Too Big messages from that router), but 285 frequent repetition of this attack could cause lots of datagrams to 286 be dropped. A host, however, should never raise its estimate of the 287 PMTU based on a Datagram Too Big message, so should not be 288 vulnerable to this attack. A malicious party could also cause 289 problems if it could stop a victim from receiving legitimate 290 Datagram Too Big messages, but in this case there are simpler 291 denial-of-service attacks available. In another case if the 292 packets are always rejected because of higher MTU and the sender 293 does not change the packet size or the admin does not adjust the 294 MTU, there is a risk of a DOS attack on the Switch sending the ICMP 295 Error packet. Multicast packet send at high rate can consume the 296 CPU resources of all the Routers implementing the PMTU for 297 Multicast. 299 7 References 300 7.1 Normative References 301 [1] J. Mogul, S. Deering. Path MTU Discovery. RFC 1191, DECWRL 302 and Stanford University, November, 1990. 303 [2] J. Postel, INTERNET CONTROL MESSAGE PROTOCOL. RFC 791, 304 ISI, September 1981. 305 7.2 Informative References 306 [3] 308 [4] 309 [5] 312 8 Acknowledgments 313 The authors thank the contributors of [RFC1191] and RFC{5501] since 314 the structure and content of this document were, for some sections, 315 largely inspired from it. The authors also thank Mark Pearson and 316 others for their valuable reviews and feedback. THIS SOFTWARE IS 317 PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY 318 EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 319 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 320 PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR 321 CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 322 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 323 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF 324 USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 325 ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 326 OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT 327 OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 328 SUCH DAMAGE. 330 9 Authors' Addresses 331 Tathagata Nandy 332 Hewlett Packard India Software Operations Pvt. Ltd. 333 Survey # 192, Whitefield Road, 334 Mahadevapura Post, Bangalore 560048. India 335 Phone: (+91) 9611895857 336 EMail: tathagata.nandy@hpe.com 338 Nitin Singla 339 Hewlett Packard India Software Operations Pvt. Ltd. 340 Survey # 192, Whitefield Road, 341 Mahadevapura Post, Bangalore 560048. India 342 Phone: (+91)7411937209 343 EMail: singla@hpe.com 345 Utkarsh Srivasta 346 Hewlett Packard India Software Operations Pvt. Ltd. 347 Survey # 192, Whitefield Road, 348 Mahadevapura Post, Bangalore 560048. India 349 Phone: (+91)7411937209 350 EMail: usrivastava@hpe.com