idnits 2.17.1 draft-dunbar-armd-problem-statement-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Sep 2009 rather than the newer Notice from 28 Dec 2009. (See https://trustee.ietf.org/license-info/) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 14, 2011) is 4782 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'ARP' is defined on line 482, but no explicit reference was found in the text == Unused Reference: 'Gratuitous ARP' is defined on line 497, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 ARMD BOF L. Dunbar 2 Internet Draft S. Hares 3 Intended status: Informational Huawei 4 Expires: September 2011 M. Sridharan 5 N. Venkataramaiah 6 Microsoft 7 B. Schliesser 8 Cisco Systems 9 March 14, 2011 11 Address Resolution for Large Data Center Problem Statement 12 draft-dunbar-armd-problem-statement-01.txt 14 Status of this Memo 16 This Internet-Draft is submitted to IETF in full conformance with 17 the provisions of BCP 78 and BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six 25 months and may be updated, replaced, or obsoleted by other documents 26 at any time. It is inappropriate to use Internet-Drafts as 27 reference material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html 35 This Internet-Draft will expire on September 14, 2011. 37 Copyright Notice 39 Copyright (c) 2011 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with 47 respect to this document. Code Components extracted from this 48 document must include Simplified BSD License text as described in 49 Section 4.e of the Trust Legal Provisions and are provided without 50 warranty as described in the BSD License. 52 Abstract 54 Modern data center networks face a number of scale challenges. One 55 such challenge for so-called "massive" data center networks is 56 address resolution, such as is provided by ARP and/or ND. This 57 document describes the problem of address resolution in massive data 58 centers. It discusses the network impact of various data center 59 technologies including server virtualization, illustrates reasons 60 why it is still desirable to have multiple hosts on the same Layer 2 61 data center network, and describes potential address resolution 62 problems this type of Layer 2 network will face. 64 Conventions used in this document 66 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 67 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 68 document are to be interpreted as described in RFC-2119. 70 Table of Contents 72 1. Introduction...................................................3 73 2. Terminology....................................................4 74 3. Layer 2 Requirements in the Data Center........................5 75 3.1. Layer 2 Requirement for VM Migration......................5 76 3.2. Layer 2 Requirement for Load Balancing....................5 77 3.3. Layer 2 Requirement for Active/Standby VMs................6 78 4. Cloud and Internet Data Centers with Virtualized Servers.......6 79 5. ARP Issues in the Data Center..................................7 80 6. ARPs & VM Migration............................................9 81 7. Limitations of VLANs/Smaller Subnets in the Cloud Data Center.10 82 8. Why IETF Needs To Develop Solutions Instead of IEEE 802.......10 83 9. Conclusion and Recommendation.................................10 84 10. Manageability Considerations.................................11 85 11. Security Considerations......................................11 86 12. IANA Considerations..........................................11 87 13. Acknowledgments..............................................11 88 14. References...................................................11 89 Authors' Addresses...............................................12 90 Intellectual Property Statement..................................12 91 Disclaimer of Validity...........................................13 93 1. Introduction 95 Modern data center networks face a number of scale challenges, 96 especially as they reach sizes and densities that are "massive" 97 relative to historical norms. One such challenge is the effective 98 and efficient performance of address resolution, such as is provided 99 by ARP and/or ND. 101 The fundamental issue challenging address resolution in massive data 102 centers is the need to grow both the number and density of logical 103 Layer 2 segments while retaining flexibility in the physical 104 location of host attachment. This problem has historically been 105 bounded by physical limits on data center size, as well as practical 106 considerations in the physical placement of server resources. 107 However, the increasing popularity of server virtualization 108 technology (e.g. in support of "cloud" computing), the trend toward 109 building physically massive data center facilities, and the logical 110 extension of network segments across traditional geographic 111 boundaries is driving an increase of the number of addresses in the 112 modern data center network. 114 1.1. Server Virtualization 116 Server virtualization allows the sharing of the underlying physical 117 machine (server) resources among multiple virtual machines, each 118 running its own operating system. Server virtualization is the key 119 enabler to data center workload agility, i.e. allowing any server to 120 host any applications and providing the flexibility of adding, 121 shrinking, or moving services among the physical infrastructure. 122 Server virtualization provides numerous benefits, including higher 123 utilization, increased data security, reduced user downtime, and 124 even significant power conservation, along with the promise of a 125 more flexible and dynamic computing environment. However, server 126 virtualization also stresses the data center network by enabling the 127 creation of many more network hosts (accompanied by their network 128 interfaces and addresses) within the same physical footprint. 130 Further, in order to maximize the benefits of server virtualization, 131 VM placement algorithms (e.g. based on efficiency, capacity, 132 redundancy, security, etc) may be designed in such a way that 133 increases both the range and density of Layer 2 segments. For 134 instance, these algorithms may satisfy the processing requirements 135 of each VM while requiring the minimal number of physical servers 136 and switching devices, simultaneously spreading the VM hosts across 137 a diverse and redundant infrastructure. Such an algorithm may 138 potentially result in a large number of diverse Layer 2 segments 139 attached to each physical host, as well as a larger number and range 140 of data center-wide Layer 2 segments. With this, and similar types 141 of VM assignment algorithm, subnets tend to extend throughout the 142 network and ARP/ND traffic associated with each subnet is likely to 143 traverse a significant number of links and switches in the network. 145 1.2. Physically Massive Facilities 147 Regardless of server virtualization technology, in recent years the 148 physical facility of a data center has been seen to grow larger. 149 There are inherent efficiencies in constructing larger data center 150 buildings, infrastructure, and networks. As data center operators 151 pursue these physical efficiencies, the address resolution problem 152 described by this document becomes more prevalent. Physically 153 massive data centers may face address resolution scale challenges 154 simply due to their physical capacity. Combined with server 155 virtualization, the host and address density of these facilities is 156 historically unmatched. 158 1.3. Geographically Extended Network Segments 160 The modern data center network is influenced by the demands of 161 flexibility due to cloud computing, demands of redundancy due to 162 regulatory or enterprise uptime requirements, as well as demands on 163 topology due to security and/or performance. In support of these 164 demands and others, VPN and physical network extensions (including 165 both Layer 3 and Layer 2 extensions) increase the data center 166 network scope beyond physical and/or geographical boundaries. 168 As such, the number of addresses that are present on a single Layer 169 2 segment may be greater than the number of hosts physically or 170 logically present within the data center itself. Combined with 171 physically massive data center facilities and server virtualization, 172 this trend results in a potential for massive numbers of addresses 173 per Layer 2 segment, beyond any historical norm, truly challenging 174 address resolution protocols such as ARP and/or ND. 176 2. Terminology 178 Aggregation Switch: A Layer 2 switch interconnecting ToR switches 180 Bridge: IEEE802.1Q compliant device. In this draft, Bridge is used 181 interchangeably with Layer 2 switch. 183 CUG: Closed User Group 185 DC: Data Center 187 DA: Destination Address 189 EOR: End of Row switches in data center. 191 FDB: Filtering Database for Bridge or Layer 2 switch 193 SA: Source Address 195 ToR: Top of Rack Switch. It is also known as access switch. 197 VM: Virtual Machines 199 VPN: Virtual Private Network 201 3. Layer 2 Requirements in the Data Center 203 3.1. Layer 2 Requirement for VM Migration 205 VM migration refers to moving virtual machines from one physical 206 server to another. Current technology even allows for the real-time 207 migration of VM hosts in a "live" state. Seamlessly moving VMs 208 within a resource pool is the key to achieve efficient server 209 utilization and data center agility. 211 One of the key requirements for VM migration is the VM maintaining 212 the same IP address and MAC address after moving to the new 213 location, so that its operation can be continued in the new 214 location. This requirement is even more stringent in the case of 215 "live" migrations, for which ongoing stateful connections must be 216 maintained. Thus, in absence of new technology, VMs can only be 217 migrated among servers on the same Layer 2 network. 219 3.2. Layer 2 Requirement for Network Services 221 Many network services such as firewalls and load balancers must be 222 in-line with network traffic in order to function correctly. As 223 such, Layer 2 networks often provide a form of traffic engineering 224 for steering traffic through these devices for a given subnet or 225 segment. 227 Further, even in some cases where the network service need not be 228 in-line for all traffic, it must be connected on a common Layer 2 229 segment in order to function. One such common application is load 230 balancing (providing a single Internet service from multiple 231 servers) with Layer 2 Direct Server Return. While a traditional load 232 balancer typically sits in-line between the client and the hosts 233 that provide the services to the client, for applications with 234 relative smaller amount of traffic going into servers and relative 235 large amount of traffic from servers, it is sometimes desirable to 236 allow reply data from servers go directly to clients without going 237 through the Load Balancer. In this kind of design it is necessary 238 for Load Balancer and the cluster of hosts to be on same Layer 2 239 network so that they communicate with each other via their MAC 240 addresses. 242 3.3. Layer 2 Requirement for Active/Standby VMs 244 For redundant servers (or VMs) serving redundant instances of the 245 same applications, both Active and Standby servers (VMs) need to 246 share keep-alive messages between them. Further, the mechanism for 247 failing over from Active to Standby may be facilitated by assumption 248 of a shared MAC address and/or some kind of ARP/ND announcement. 249 When the Active server fails/is taken out of service, the switch 250 over to the Standby would be transparent if they are on the same 251 Layer 2 network. 253 4. Cloud and Internet Data Centers with Virtualized Servers 255 Cloud Computing service often allows subscribers to create their own 256 virtual hosts and virtual subnets which are housed within the cloud 257 providers' data centers. Network service providers may also extend 258 existing VPNs to connect with VMs that are hosted by servers in the 259 provider's data center(s). This is often realized by grouping hosts 260 belonging to one subscriber's VPN into distinct segregated subnets 261 in the data center(s). This design for a multi-tenant data center 262 network typically requires the secure segregation of different 263 customers' VMs and hosts. 265 Further, these client subnets in the data center could have client- 266 specific IP addresses, which could lead to possible overlaps in 267 address spaces. In this scenario, it is very critical to segregate 268 traffic among different client subnets (or VPNs) in data center. As 269 a result, within a cloud data center there may be a larger number of 270 distinct Layer 2 segments as well as a larger demand for host 271 density within each Layer 2 segment. 273 Cloud/Internet Data Centers have the following special properties: 275 . Massive number of hosts 276 Consider a typical tree structured Layer 2 network, with one or 277 two aggregation switches connected to a group of Top of Rack (ToR) 278 switches and each ToR switch connected to a group of physical 279 servers. The number of servers connected in this network is 280 limited to the port count of the ToR switches. For example, if a 281 ToR switch has 20 downstream ports, there are only 20 servers or 282 hosts connected to it. If the aggregation switch has 256 ports 283 connecting to ToR switches, there could be up to 20*256=5120 hosts 284 connected to one aggregation switch when the servers are not 285 virtualized. 287 When servers are virtualized, one server can support tens or 288 hundreds of VMs. Hypothetically, if one server supports up to 100 289 VMs, the same ToR switches and Aggregation switch as above would 290 need to support up to 512000 hosts. Even if there is enough 291 bandwidth on the links to support the traffic volume from all 292 those VMs, other issues associated with Layer 2, like frequent ARP 293 broadcast by hosts and flooding due to unknown DA, create 294 challenges to the network. 296 . Massive number of client subnets or Closed User Groups co-existing 297 in the data center, with each subnet having their own IP addresses 299 In the example of VPN (L2VPN or L3VPN)extended with virtual 300 machines residing in Service Provider data centers, each VPN 301 would require an unique subnet for its associated VMs in the 302 data center. Due to large number of VPNs being deployed today, 303 those types of services can require a large number of subnets to 304 be supported by the data center. 306 . Hosts (VMs) migrate from one location to another 308 When data center is virtualized, physical resource and logical 309 hosts/contents are separated. One application could be loaded to 310 any Virtual Machines on any servers, and could be migrated to 311 different locations during the continuous process of minimizing 312 the physical resources consumed in data center(s). 314 As discussed earlier, this migration requires the VMs to 315 maintain the same IP and MAC addresses. The association to their 316 corresponding subnet (or VPN) should not be changed either. 318 5. ARP/ND Issues in the Data Center 320 Traditional Layer 2 networks placed hosts belonging to one subnet 321 (or VLAN) closely together, so that broadcast messages among hosts 322 in the subnet are confined to the access switches. However this kind 323 of network design puts a lot of constraints on where VMs can be 324 placed and can lead to very unbalanced utilization of data center 325 resources. 327 In data center with virtualized servers, data center administrators 328 may want to leverage the flexibility of server virtualization to 329 place VMs in such a way that satisfies the processing requirements 330 of each VM but require the minimal number of physical servers and 331 switching devices. When those types of VM placement algorithms are 332 used, hosts can be attached and re-attached at any location on the 333 network. IPv4 hosts use ARP (Address Resolution Protocol-RFC826) to 334 find the corresponding MAC address of a target host. IPv4 ARP is a 335 protocol that uses the Ethernet broadcast service for discovering a 336 host's MAC address from its IP address. For host A to find the MAC 337 address of a host B on the same subnet with IP Address B-IP, host A 338 broadcasts an ARP query packet containing B as well as its own IP 339 address (A ) on its Ethernet interface. All hosts in the same subnet 340 receive the packet. Host B, whose IP address is B , replies (via 341 unicast) to inform A of its MAC address. A will also record the 342 mapping between B and B-MAC. 344 Even though all hosts maintain the MAC to target IP address mapping 345 locally to avoid repetitive ARP broadcast message for the same 346 target IP address, hosts age out their learnt MAC to IP mapping very 347 frequently. For Microsoft Windows (Versions XP and Server 2003), the 348 default ARP cache policy is to discard entries that have not been 349 used in at least two minutes, and for cache entries that are in use, 350 to retransmit an ARP request every 10 minutes. So hosts send out ARP 351 very frequently. 353 In addition to broadcast messages sent from hosts, Layer 2 switches 354 also flood received data frames if the destination MAC address is 355 unknown. 357 The flooding and broadcast have worked well in the past when hosts 358 belonging to one subnet (or VLAN) are placed closely together. A 359 common scenario is for Layer 2 networks to limit the number of hosts 360 in one subnet to be less then 200, so that broadcast storms and 361 flooding can be restricted to a smaller domain when all the hosts 362 are confined to small number of ports on access switches. When 363 subnets are spanning across multiple ToR switches or EoR switches, 364 each subnet's broadcast messages and flooding will be exposed to the 365 backbone links and switches of entire Data Center network. Then, the 366 network will experience the similar problems as one big flat Layer 2 367 network. With large number of hosts in data centers with virtualized 368 servers, the amount of broadcast messages and flooding over the 369 backbone links can take away huge amount of bandwidth. 371 As indicated in Reference [Scaling Ethernet], Carnegie Mellon did a 372 study on the number of ARP queries received at a workstation on 373 CMU's School of Computer Science LAN over a 12 hour period on August 374 9, 2004. At peak, the host received 1150 ARPs per second, and on 375 average, the host received 89 ARPs per second. During the data 376 collection, 2,456 hosts were observed sending ARP queries. The 377 report expects that the amount of ARP traffic will scale linearly 378 with the number of hosts on the LAN. For 1 million hosts, it is 379 expected to have 468,240 ARPs per second or 239 Mbps of ARP traffic 380 at peak, which is more than enough to overwhelm a standard 100 Mbps 381 LAN connection. Ignoring the link capacity, forcing servers to 382 handle an extra half million packets per second to inspect each ARP 383 packet would impose a prohibitive computational burden. 385 6. ARPs & VM Migration 387 In general, there are more flooding and more ARP messages when VMs 388 migrate.VM migration in Layer 2 environments will require updating 389 the Layer 2 (MAC) FDB in the individual switches in the data center 390 to ensure accurate forwarding. Consider a case where a VM migrates 391 across racks. The migrated VM often sends out a gratuitous ARP 392 broadcast when it comes up at the new location. This is flooded by 393 the TOR switch at the new rack to the entire network. The TOR at the 394 old rack is not aware of the migration until it receives this 395 gratuitous ARP. So it continues to forward frames to the port where 396 it learnt the VM's MAC address from before, leading to black holing 397 of traffic. The duration of this black holing period may depend upon 398 the topology. It may be longer if the VM has moved to a rack in a 399 different data center connected to this data center over Layer 2. 401 During transition periods, some hosts might be temporarily taken out 402 of service. Then, there will be lots of ARP request broadcast 403 messages repetitively transmitted from hosts to those temporarily out 404 of service hosts. Since there is no response from those target hosts, 405 switches do not learn their path, which will cause ARP messages from 406 various hosts being flooded across the network. 408 Simple VLAN partitioning is no longer enough to segregate traffic 409 among tens of thousands of subnets (or Closed User Groups) within a 410 data center. Some types of encapsulation have to be used, like MAC- 411 in-MAC or TRILL, to further isolate the traffic belonging to 412 different subnets. When encapsulation is performed by TOR, VMs 413 migration can cause more broadcast messages and more data frames 414 being flooded in the network due to new TOR not knowing the 415 destination address of the outer header of the encapsulation. 417 Therefore, it is very critical to have some types of ARP optimization 418 or extended ARP reply for TOR switches which perform the 419 encapsulation. This can involve knowledge of the target TOR address, 420 so that the amount of flooding among TOR switches due to unknown 421 destination can be dramatically reduced. 423 7. Limitations of VLANs/Smaller Subnets in the Cloud Data Center 425 Large data centers might need to support more subnets or VLANs than 426 4095. So, simple VLAN partitioning is no longer enough to segregate 427 traffic among all those subnets. To enforce traffic segregation 428 among all those subnets, some types of encapsulation have to be 429 implemented. 431 As the result of continuous VM migration, hosts in one subnet (VLAN) 432 may start with being close together and gradually being relocated to 433 various places. 435 When one physical server is supporting more than 100 Virtual 436 Machines, i.e. >100 hosts, it may start with serving hosts belonging 437 to smaller number of VLANs. But gradually, as VM migration proceeds, 438 hosts belonging to different VLANs may end up being loaded to VMs on 439 this server. Consider a case when there are 50 subnets (VLANs) 440 enabled on the switch port to the server, the server has to handle 441 all the ARP broadcast messages on all 50 subnets (VLANs). The amount 442 of ARP to be processed by each server is still too much. 444 8. Why IETF Needs To Develop Solutions Instead of IEEE 802 446 ARP involves IP to MAC mapping, which traditionally has been 447 standardized by IETF, e.g. RFC826. 449 9. Conclusion and Recommendation 451 When there are tens of thousands of VMs in one Data Center or 452 multiple data centers interconnected by a common Layer 2 network, 453 Address Resolution has to be enhanced to support large scale data 454 center and service agility 456 Therefore, we recommend that the IETF engage in the study of this 457 address resolution scale problem and, if appropriate, the 458 development of interoperable solutions for address resolution in 459 massive data center networks. 461 10. Manageability Considerations 463 This document does not add additional manageability considerations. 465 11. Security Considerations 467 This document discusses a number of topics with their own security 468 concerns, such as address resolution mechanisms including ARP and/or 469 ND as well as multi-tenant data center networks, but creates no 470 additional requirement for security. 472 12. IANA Considerations 474 This document creates no additional IANA considerations. 476 13. Acknowledgments 478 Many thanks to T. Sridhar for his contributions to the text. 480 14. References 482 [ARP] D.C. Plummer, "An Ethernet address resolution protocol." 483 RFC826, Nov 1982. 485 [Microsoft Windows] "Microsoft Windows Server 2003 TCP/IP 486 implementation details." 487 http://www.microsoft.com/technet/prodtechnol/windowsserver 488 2003/technologies/networking/tcpip03.mspx, June 2003. 490 [Scaling Ethernet] Myers, et. al., " Rethinking the Service Model: 491 Scaling Ethernet to a Million Nodes", Carnegie Mellon 492 University and Rice University 494 [Cost of a Cloud] Greenberg, et. al., "The Cost of a Cloud: Research 495 Problems in Data Center Networks" 497 [Gratuitous ARP] S. Cheshire, "IPv4 Address Conflict Detection", 498 RFC 5227, July 2008. 500 Authors' Addresses 502 Linda Dunbar 503 Huawei Technologies 504 1700 Alma Drive, Suite 500 505 Plano, TX 75075, USA 506 Phone: (972) 543 5849 507 Email: ldunbar@huawei.com 509 Sue Hares 510 Huawei Technologies 511 2330 Central Expressway, 512 Santa Clara, CA 95050, USA 513 Phone: 514 Email: shares@huawei.com 516 Murari Sridharan 517 Microsoft Corporation 518 muraris@microsoft.com 520 Narasimhan Venkataramaiah 521 Microsoft Corporation 522 One Microsoft Way 523 Redmond, WA 98052-6399 USA 524 Phone : 425-707-4328 525 Email : narave@microsoft.com 527 Benson Schliesser 528 Cisco Systems, Inc. 529 Phone: 530 Email: bschlies@cisco.com 532 Intellectual Property Statement 534 The IETF Trust takes no position regarding the validity or scope of 535 any Intellectual Property Rights or other rights that might be 536 claimed to pertain to the implementation or use of the technology 537 described in any IETF Document or the extent to which any license 538 under such rights might or might not be available; nor does it 539 represent that it has made any independent effort to identify any 540 such rights. 542 Copies of Intellectual Property disclosures made to the IETF 543 Secretariat and any assurances of licenses to be made available, or 544 the result of an attempt made to obtain a general license or 545 permission for the use of such proprietary rights by implementers or 546 users of this specification can be obtained from the IETF on-line 547 IPR repository at http://www.ietf.org/ipr 549 The IETF invites any interested party to bring to its attention any 550 copyrights, patents or patent applications, or other proprietary 551 rights that may cover technology that may be required to implement 552 any standard or specification contained in an IETF Document. Please 553 address the information to the IETF at ietf-ipr@ietf.org. 555 Disclaimer of Validity 557 All IETF Documents and the information contained therein are 558 provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION 559 HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, 560 THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL 561 WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY 562 WARRANTY THAT THE USE OF THE INFORMATION THEREIN WILL NOT INFRINGE 563 ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS 564 FOR A PARTICULAR PURPOSE. 566 Acknowledgment 568 Funding for the RFC Editor function is currently provided by the 569 Internet Society.