idnits 2.17.1 draft-armd-datacenter-reference-arch-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 24, 2011) is 4540 days in the past. Is this intentional? Checking references for intended status: None ---------------------------------------------------------------------------- == Unused Reference: 'ARP' is defined on line 386, but no explicit reference was found in the text == Unused Reference: 'ND' is defined on line 389, but no explicit reference was found in the text == Unused Reference: 'STUDY' is defined on line 392, but no explicit reference was found in the text == Unused Reference: 'DATA1' is defined on line 397, but no explicit reference was found in the text == Unused Reference: 'DATA2' is defined on line 402, but no explicit reference was found in the text Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 ARMD Working Group M. Karir 2 Internet Draft Merit Network Inc. 3 Intended status: Informational Track Ian Foo 4 Expires: January 2012 Huawei Technologies 6 October 24, 2011 8 Data Center Reference Architectures 9 draft-armd-datacenter-reference-arch-01.txt 11 Status of this Memo 13 This Internet-Draft is submitted to IETF in full conformance with 14 the provisions of BCP 78 and BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six 22 months and may be updated, replaced, or obsoleted by other documents 23 at any time. It is inappropriate to use Internet-Drafts as 24 reference material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html 32 This Internet-Draft will expire on January 24, 2012. 34 Copyright Notice 36 Copyright (c) 2011 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (http://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with 44 respect to this document. Code Components extracted from this 45 document must include Simplified BSD License text as described in 46 Section 4.e of the Trust Legal Provisions and are provided without 47 warranty as described in the Simplified BSD License. 49 Abstract 51 The continued growth of large-scale data centers has resulted in a 52 wide range of architectures and designs. Each design is tuned to 53 address the challenges and requirements of the specific applications 54 and workload that the data is being built for. Each design evolves 55 as engineering solutions are developed to workaround limitations of 56 existing protocols, hardware, as well as software implementations. 58 The goal of this document is to characterize this problem space in 59 detail in order to better understand if there is any gap in making 60 address resolution scale in various network designs for data 61 centers. In particular it is our goal to peel back the various 62 optimization and engineering solutions to develop generalized 63 reference architectures for a data center. We also discuss the 64 various factors that influence design choices in developing various 65 data center designs. 67 Conventions used in this document 69 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 70 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 71 document are to be interpreted as described in RFC-2119 0. 73 Table of Contents 75 1. Introduction...................................................3 76 2. Terminology....................................................3 77 3. Generalized Data Center Design.................................4 78 3.1. Access Layer..............................................5 79 3.2. Aggregation Layer.........................................5 80 3.3. Core......................................................5 81 3.4. Layer 3 / Layer 2 Topological Variations..................5 82 3.4.1. Layer 3 to Access Switches...........................5 83 3.4.2. L3 to Aggregation Switches...........................5 84 3.4.3. L3 in the Core only..................................6 85 3.4.4. Overlays.............................................6 86 4. Factors that Affect Data Center Design.........................7 87 4.1. Traffic Patterns..........................................7 88 4.2. Virtualization............................................7 89 4.3. Impact of Data Center Design on L2/L3 protocols...........8 90 5. Conclusion and Recommendation..................................8 91 6. Manageability Considerations...................................9 92 7. Security Considerations........................................9 93 8. IANA Considerations............................................9 94 9. Acknowledgments................................................9 95 10. References....................................................9 96 Authors' Addresses...............................................10 97 Intellectual Property Statement..................................10 98 Disclaimer of Validity...........................................10 100 1. Introduction 102 Data centers are a key part of delivering Internet scale 103 applications. Data center design and network architecture is an 104 important aspect of the overall service delivery plan. This 105 includes not only determining the scale of physical and virtual 106 servers but also optimizations to the entire data center stack 107 including in particular the layer 3 and layer 2 architectures. 108 Depending on the particular application requirements and scale, data 109 centers can be designed in variety of ways. Each design is often a 110 representation of which aspects of the problem were and were not 111 relevant to the purpose of that data center. In this document we 112 attempt to generalize the various design optimizations into a common 113 generic architecture to facilitate the discussion of potential 114 issues under a common framework. 116 2. Terminology 118 ARP: Address Resolution Protocol 120 ND: Neighbor Discovery 122 Host: Application running on a physical server or a virtual 123 machine. A host usually has at least one IP address and at 124 least one MAC address. 126 Server: a physical computing machine 128 ToR: Top of Rack Switch 130 EoR: End of Row 132 VM: Virtual Machines. Each server can support multiple VMs. 134 3. Generalized Data Center Design 136 There are many different ways in which data centers might be 137 designed. The designs are usually engineered to suit the particular 138 application that is being deployed in the data center. For example, 139 a massive web sever farm might be engineered in a very different way 140 than a general-purpose multi-tenant cloud hosting service. However 141 in most cases the designs can be abstracted into a typical three- 142 layer model consisting of the Access Layer, the Aggregation Layer 143 and the Core. The access layer generally refers to the Layer 2 144 switches that are closest to the physical or virtual severs, the 145 aggregation layer refers to the Layer 2 - Layer 3 boundary. The 146 Core switches connect the aggregation switches to the larger network 147 core. Figure 1 shows a generalized Data Center design, which 148 captures the essential elements of various alternatives. 150 +-----+-----+ +-----+-----+ 151 | Core0 | | Core1 | Core 152 +-----+-----+ +-----+-----+ 153 / \ / / 154 / \----------\ / 155 / /---------/ \ / 156 +-------+ +------+ 157 +/------+ | +/-----+ | 158 | Aggr11| + --------|AggrN1| + Aggregation Layer 159 +---+---+/ +------+/ 160 / \ / \ 161 / \ / \ 162 +---+ +---+ +---+ +---+ 163 |T11|... |T1x| |T21| ... |T2y| Access Layer 164 +---+ +---+ +---+ +---+ 165 | | | | | | | | 166 +---+ +---+ +---+ +---+ 167 | |... | | | | ... | | 168 +---+ +---+ +---+ +---+ Server racks 169 | |... | | | | ... | | 170 +---+ +---+ +---+ +---+ 171 | |... | | | | ... | | 172 +---+ +---+ +---+ +---+ 174 Figure 1: Typical Layered Architecture in DC 176 3.1. Access Layer 178 The Access switches provide connectivity directly to/from physical 179 and virtual servers. The access switches might be placed either on 180 top-of-rack (ToR) or at end-of-row(EoR) physical configuration. A 181 server rack may have a single uplink to one access switch, or may 182 have dual uplinks to two different access switches. 184 3.2. Aggregation Layer 186 In a typical data center, aggregation switches interconnect many ToR 187 switches. Usually there are multiple parallel aggregation switches, 188 serving the same group of ToRs to achieve load sharing. It is no 189 longer uncommon to see aggregation switches interconnecting hundreds 190 of ToR switches in large data centers. 192 3.3. Core 194 Core switches connect multiple aggregation switches and act as the 195 data center gateway to external networks or interconnect to 196 different PODs within one data center. 198 3.4. Layer 3 / Layer 2 Topological Variations 200 3.4.1. Layer 3 to Access Switches 202 In this scenario the L3 domain is extended all the way to the Access 203 Switches. Each rack enclosure consists of a single Layer 2 domain, 204 which is confined to the rack. In general in this scenario there 205 are no significant ARP/ND scaling issues as the Layer 2 domain 206 cannot grow very large. This topology is ideal for scenarios where 207 servers (or VMs) under one access switch don't need to be re-loaded 208 with applications with different IP addresses or hosts don't need to 209 be moved to other racks which are under different access switches. 210 A small server farm or very static compute cluster might be best 211 served via this design. 213 3.4.2. L3 to Aggregation Switches 215 When Layer 3 domain only extends to aggregation switches, hosts in 216 any of the IP subnets configured on the aggregation switches can be 217 reachable via Layer 2 through any access switches if access switches 218 enable all the VLANs. This topology allows for a great deal of 219 flexibility as servers attached to one access switch can be re- 220 loaded with applications with different IP prefix and VMs can now 221 migrate between racks without IP address changes. The drawback of 222 this design however is that multiple VLANs have to be enabled on all 223 access switches and all ports of aggregation switches. Even though 224 layer 2 traffic are still partitioned by VLANs, the fact that all 225 VLANs enabled on all ports can lead to broadcast traffic on all 226 VLANs to traverse all links and ports, which is same effect as one 227 big Layer 2 domain. In addition, internal traffic itself might have 228 to cross different Layer 2 boundaries resulting in significant 229 ARP/ND load at the aggregation switches. This design provides the 230 best flexibility/Layer 2 domain size trade-off. A moderate sized 231 data center might utilize this approach to provide high availability 232 services at a single location. 234 3.4.3. L3 in the Core only 236 In some cases where wider range of VM mobility is desired (i.e. 237 greater number of racks among which VMs can move without IP address 238 change), the Layer 3 routed domain might be terminated at the core 239 routers themselves. In this case VLANs can span across multiple 240 groups of aggregation switches, which allow hosts to be moved among 241 more number of server racks without IP address change. This scenario 242 results in the largest ARP/ND performance impact as explained later. 243 A data center with very rapid workload shifting may consider this 244 kind of design. 246 3.4.4. Overlays 248 There are several approaches regarding how overlay networks can make 249 very large layer 2 network scale and enable mobility. Overlay 250 networks using various Layer 2 or Layer 3 mechanisms enable interior 251 switches/routers not to see the hosts' addresses. The Overlay Edge 252 switches/routers which perform the network address 253 encapsulation/decapsulation still however see host addresses. 255 When a large data center has tens of thousands of applications which 256 communicate with peers in different subnets, all those applications 257 send (and receive) data packets to their L2/L3 boundary nodes if the 258 targets are in different subnets. The L2/L3 boundary nodes have to 259 process ARP/ND requests sent from originating subnets and resolve 260 physical addresses (MAC) in the target subnets. In order to allow a 261 great number of VMs to move freely within a data center without re- 262 configuring IP addresses, they need to be under the common Gateway 263 routers. That means the common gateway has to handle address 264 resolution for all those hosts. Therefore, the use of overlays in 265 the data center network can be a useful design mechanism to help 266 manage a potential bottleneck at the Layer 2 / Layer 3 boundary by 267 redefining where that boundary exists. 269 4. Factors that Affect Data Center Design 271 4.1. Traffic Patterns 273 Expected traffic patterns play an important role in designing the 274 appropriately sized Access, Aggregation and Core networks. Traffic 275 patterns also vary based on the expected use of the Data Center. 276 Broadly speaking it is desirable to keep as much traffic as possible 277 on the Access Layer in order to minimize the bandwidth usage at the 278 Aggregation Layer. If the expected use of the data center is to 279 serve as a large web server farm, where thousands of nodes are doing 280 similar things and the traffic pattern is largely in/out a large 281 access layer with EoR switches might be of the most use as it 282 minimizes complexity, allows for servers and databases to be located 283 in the same Layer 2 domain and provides for maximum density. 285 A Data Center that is expected to host a multi-tenant cloud hosting 286 service might have completely different requirements where in order 287 to isolate inter-customer traffic smaller Layer 2 domains are 288 preferred and though the size of the overall Data Center might be 289 comparable to the previous example, the multi-tenant nature of the 290 cloud hosting application requires a smaller more compartmentalized 291 Access layer. A multi-tenant environment might also require the use 292 of Layer 3 all the way to the Access Layer ToR switch. 294 Yet another example of an application with a unique traffic pattern 295 is a high performance compute cluster where most of the traffic is 296 expected to stay within the cluster but at the same time there is a 297 high degree of crosstalk between the nodes. This would once again 298 call for a large Access Layer in order to minimize the requirements 299 at the Aggregation Layer. 301 4.2. Virtualization 303 Using virtualization in the Data Center further serves to increase 304 the possible densities that can be achieved. Virtualization also 305 further complicates the requirements on the Access Layer as that 306 determines the scope of server migrations or failover of servers on 307 physical hardware failures. 309 Virtualization also can place additional requirements on the 310 Aggregation switches in terms of address resolution table size and 311 the scalability of any address learning protocols that might be used 312 on those switches. The use of virtualization often also requires the 313 use of additional VLANs for High Availability beaconing which would 314 need to span across the entire virtualized infrastructure. This 315 would require the Access Layer to span as wide as the virtualized 316 infrastructure. 318 4.3. Impact of Data Center Design on L2/L3 protocols 320 When a L2/L3 boundary router receives data packets via its L3 321 interfaces destined towards hosts under its L2 domain, if the target 322 address is not present in the router's ARP/ND cache, it usually 323 holds the data packets and initiates ARP/ND requests towards its L2 324 domain to make sure the target actually exists before forwarding the 325 data packets to the target. If no response is received, the router 326 has to send the ARP/ND multiple times. If no response is received 327 after X number ARP/ND requests, the router needs to drop all those 328 data packets. This process can be very CPU intensive. 330 When a local host under the L2/L3 Router's L2 domain needs to send a 331 data frame to external peers, it usually sends ARP/ND requests to 332 get the physical address (i.e. MAC) of the L2/L3 routers. Many hosts 333 repetitively send ARP/ND requests to their default L3 gateway 334 routers to refresh its ARP/ND cache. This requires default routers 335 to process great number of ARP/ND requests when the number of hosts 336 under its L2 domains is very large. For IPv4, gateway routers 337 frequently sending out gratuitous ARP for all the hosts under its L2 338 domain to refresh their ARP cache for the default gateway's MAC 339 address can mitigate this pain point. However, for IPv6 hosts need 340 to validate bi-direction communication with the gateway router 341 before sending any data frames. Therefore, unsolicited neighbor 342 announcement from gateway router can't prevent hosts from sending ND 343 repetitively. 345 When hosts in two different subnets under the same L2/L3 boundary 346 router need to communicate with each other, the L2/L3 router not 347 only has to initiate ARP/ND requests to the target's Subnet, it also 348 has to process the ARP/ND requests from the originating subnet. This 349 process is even more CPU intensive. 351 5. Conclusion and Recommendation 353 In this document we have described a generalized Data Center network 354 design. Our goal is to distill the essence of different designs 355 into a common framework in an attempt to structure the discussion 356 regarding various scaling issues that might appear in different 357 scenarios. Different application needs such as traffic patterns, 358 and the role for which the data center is being designed determine 359 various design choices, which result in various scaling issues with 360 regards to port density, ARP/ND, VM mobility, and performance. As 361 expected, engineering solutions serve to tune a given design to the 362 particular needs of the data center at the expense of other factors. 364 6. Manageability Considerations 366 This document does not add additional manageability considerations. 368 7. Security Considerations 370 This document has no additional requirement for security. 372 8. IANA Considerations 374 None. 376 9. Acknowledgments 378 We want to acknowledge the following people for their valuable 379 discussions related to this draft: Kyle Creyts, Alexander Welch and 380 Michael Milliken 382 This document was prepared using 2-Word-v2.0.template.dot. 384 10. References 386 [ARP] D.C. Plummer, "An Ethernet address resolution protocol." 387 RFC826, Nov 1982. 389 [ND] T. Narten, E. Nordmark, W. Simpson, H. Soliman, "Neighbor 390 Discovery for IP version 6 (IPv6)." RFC4861, Sept 2007. 392 [STUDY] Rees, J., Karir, M., "ARP Traffic Study." MANOG52, June 393 2011. URL 394 http://www.nanog.org/meetings/nanog52/presentations/Tuesda 395 y/Karir-4-ARP-Study-Merit Network.pdf 397 [DATA1] Cisco Systems, Data Center Design - IP Infrastructure , 398 October 2009. URL 399 http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_ 400 Center/DC_3_0/DC-3_0_IPInfra.html 402 [DATA2] Juniper Networks, Government Data Center Network Reference 403 Architecture, 2010. URL 404 www.juniper.net/us/en/local/pdf/reference- 405 architectures/8030004-en.pdf 407 Authors' Addresses 409 Manish Karir 410 Merit Network Inc. 411 1000 Oakbrook Dr, Suite 200 412 Ann Arbor, MI 48104, USA 413 Phone: 734-527-5750 414 Email: mkarir@merit.edu 416 Ian Foo 417 Huawei Technologies 418 2330 Central Expressway 419 Santa Clara, CA 95050, USA 420 Phone: 919-747-9324 421 Email: Ian.Foo@huawei.com 423 Intellectual Property Statement 425 The IETF Trust takes no position regarding the validity or scope of 426 any Intellectual Property Rights or other rights that might be 427 claimed to pertain to the implementation or use of the technology 428 described in any IETF Document or the extent to which any license 429 under such rights might or might not be available; nor does it 430 represent that it has made any independent effort to identify any 431 such rights. 433 Copies of Intellectual Property disclosures made to the IETF 434 Secretariat and any assurances of licenses to be made available, or 435 the result of an attempt made to obtain a general license or 436 permission for the use of such proprietary rights by implementers or 437 users of this specification can be obtained from the IETF on-line 438 IPR repository at http://www.ietf.org/ipr 440 The IETF invites any interested party to bring to its attention any 441 copyrights, patents or patent applications, or other proprietary 442 rights that may cover technology that may be required to implement 443 any standard or specification contained in an IETF Document. Please 444 address the information to the IETF at ietf-ipr@ietf.org. 446 Disclaimer of Validity 448 All IETF Documents and the information contained therein are 449 provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION 450 HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, 451 THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL 452 WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY 453 WARRANTY THAT THE USE OF THE INFORMATION THEREIN WILL NOT INFRINGE 454 ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS 455 FOR A PARTICULAR PURPOSE. 457 Acknowledgment 459 Funding for the RFC Editor function is currently provided by the 460 Internet Society.