idnits 2.17.1 draft-malc-armd-moose-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 18, 2010) is 4939 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 3344 (Obsoleted by RFC 5944) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force M. Scott, Ed. 3 Internet-Draft D. Wagner-Hall 4 Intended status: Informational J. Crowcroft 5 Expires: April 21, 2011 University of Cambridge 6 October 18, 2010 8 Addressing the Scalability of Ethernet with MOOSE 9 draft-malc-armd-moose-00 11 Abstract 13 Ethernet does not scale well to large networks. The flat MAC address 14 space, whilst having obvious benefits for the user and administrator, 15 is the primary cause of this poor scalability; other recent efforts 16 to improve upon Ethernet's scalability have addressed symptoms, 17 rather than this underlying cause. MOOSE, Multi-level Origin- 18 Organised Scalable Ethernet, is an Ethernet switch architecture that 19 performs in-place rewriting of MAC addresses in order to impose a 20 hierarchy upon the address space without reconfiguration or 21 modification of connected devices. This removes the need for 22 switches to maintain large forwarding databases, is of direct use in 23 implementing improved routing, and allows for a variety of other 24 scalability and security innovations. MOOSE also includes a 25 globally-scalable, distributed and resilient protocol for the 26 automatic assignment of addresses to switches, and for detecting and 27 cheaply resolving addressing conflicts. 29 Status of this Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at http://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on April 21, 2011. 46 Copyright Notice 48 Copyright (c) 2010 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 64 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 5 65 2. Ethernet's Underlying Problem . . . . . . . . . . . . . . . . 5 66 3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 6 67 4. MOOSE Architecture . . . . . . . . . . . . . . . . . . . . . . 8 68 4.1. Shortest Path Routing . . . . . . . . . . . . . . . . . . 11 69 4.2. Address Selection and Conflict Resolution . . . . . . . . 11 70 4.3. Broadcast and Multicast . . . . . . . . . . . . . . . . . 14 71 4.4. Example . . . . . . . . . . . . . . . . . . . . . . . . . 15 72 4.5. Directory Service . . . . . . . . . . . . . . . . . . . . 16 73 4.6. Mobility . . . . . . . . . . . . . . . . . . . . . . . . . 16 74 5. Interoperability Considerations . . . . . . . . . . . . . . . 18 75 5.1. Layer-violating Protocols . . . . . . . . . . . . . . . . 18 76 5.2. Edge Virtual Bridging . . . . . . . . . . . . . . . . . . 19 77 6. Prototype Implementation . . . . . . . . . . . . . . . . . . . 20 78 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 20 79 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20 80 9. Security Considerations . . . . . . . . . . . . . . . . . . . 20 81 10. Informative References . . . . . . . . . . . . . . . . . . . . 20 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 84 1. Introduction 86 Ethernet has lasted well since its inception in the '70s with 87 Ethernet frame-structure and addressing remaining ubiquitous in the 88 data centre environment as in many others. Alongside IP and IP- 89 transported services such as iSCSI, it is now commonplace to see 90 converged network services such as physical disk interfaces and 91 cluster interconnects layered directly over Ethernet (e.g. ATA-over- 92 Ethernet and variants of Infiniband). However, Ethernet exhibits 93 scalability issues on networks of more than a few thousand devices, 94 such as costly and energy-dense address table logic and storms of 95 broadcast traffic. 97 Aside from more physical devices, virtualised infrastructure further 98 increases the density of Ethernet addresses in data centres. Widely- 99 used layer-2 virtualisation [Cl05] mandates a unique Ethernet address 100 per virtual machine. This means that each physical machine in a data 101 centre may represent many tens of Ethernet devices. 103 The traditional method of avoiding such problems is the artificial 104 subdivision of a network, but this introduces an administrative 105 burden, requires significant routing equipment and also precludes 106 seamless migration--a necessity for virtualised infrastructure. 107 While IP Mobility [RFC3344] addresses the problem of maintaining 108 higher-layer connections when roaming between subnets, it requires 109 client support that is neither ubiquitous or reliable. Common 110 practice sees the provision of one physical Ethernet network covering 111 an entire data centre, or even an entire WAN of data centres. 113 Our approach, Multi-level Origin-Organised Scalable Ethernet (MOOSE), 114 provides all the advantages of an Ethernet network without the 115 capital and running costs and administrative overhead of a IP router- 116 based approach. MOOSE does this by providing a hierarchical 117 addressing scheme without requiring host reconfiguration or 118 modification. 120 Ethernet's scalability is limited firstly by the forwarding database 121 that every switch in an Ethernet [802.1D] network must maintain. A 122 switch's forwarding database contains one entry per source address 123 seen in any frame passing through that switch, and stores that MAC 124 address together with the learnt location of that address--the port 125 on which packets from that address were last seen. This is later 126 used to determine on which port to transmit frames destined for that 127 address. Devices frequently broadcast frames throughout the network 128 (e.g. ARP queries) so active devices on the network are listed in 129 most switches' forwarding databases most of the time. 131 In modern switches the capacity of this database is generally of the 132 order of 16,000 entries. (Higher-capacity forwarding databases exist 133 but are currently constrained to very high-end switches.) On a 134 moderately large network, full databases are a serious risk. If the 135 database becomes full, entries will be discarded; frames for unknown 136 addresses are flooded to all ports and the resulting traffic storm 137 could cause major problems, especially in the presence of low- 138 capacity edge links. 140 Traditionally the forwarding database has been stored in a content- 141 addressable memory (CAM) as lookups must be very fast, particularly 142 as 10 Gbit/s Ethernet becomes ubiquitous. As networks grow, the 143 number of entries in a switch's forwarding database must naturally 144 increase; however, increasing the capacity of CAMs without 145 sacrificing speed whilst constraining energy consumption is proving 146 to be challenging. Cheaper switches use DRAM in place of a CAM, but 147 this is likely to remain slower especially for large tables. 149 Secondly, Ethernet's inability to handle networks containing loops 150 also presents a scalability problem. The Rapid Spanning Tree 151 Protocol, RSTP, must remove loops by disabling any redundant links. 152 On a dense mesh network, RSTP will disable a large proportion of 153 links; this constrains frames to suboptimal routes and may introduce 154 bottlenecks in the network, particularly around the root of the 155 spanning tree. In a data centre environment, this potentially 156 amounts to a very large proportion of capacity being wasted wherever 157 redundant fibres are installed, e.g. between cabinet switches and 158 between data centres. 160 Thirdly, not only does Ethernet flood frames destined for unknown 161 hosts, but it also uses--and encourages higher-layer protocols to 162 use-- broadcast for control messages. For example, ARP [RFC0826] 163 performs address resolution via broadcast queries, and DHCP [RFC2131] 164 uses broadcast messages for automatic configuration. It is 165 impractical to replace these protocols entirely as this would require 166 software upgrades to every device, but it would be desirable for the 167 network to minimise the amount of broadcast traffic required to be 168 forwarded. 170 In this document we identify the relevant underlying problems in the 171 design of Ethernet, review previous work and present the MOOSE switch 172 architecture, which addresses inadequacies in the fundamental 173 operation of Ethernet in a novel yet backwards-compatible way. By 174 revisiting the addressing scheme itself, rather than simply 175 addressing symptoms of the problem as many previous proposed 176 solutions have done, we can go about solving all of the above 177 scalability problems and more. 179 1.1. Requirements Language 181 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 182 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 183 document are to be interpreted as described in RFC 2119. 185 2. Ethernet's Underlying Problem 187 The original Ethernet was a shared-medium network, where every frame 188 was broadcast and no switching took place. Modern-day wired 189 Ethernet-based networks instead consist almost entirely of point-to- 190 point links; as a result of this, the distinction between unicast, 191 broadcast and multicast has become more important. 802.11 wireless 192 LANs are the one remaining vestige of Ethernet operating over shared 193 media, where one switch (access point) serves many hosts on the same 194 radio channel. 196 Ethernet's poor scalability arises in various guises, as outlined 197 above. It would seem at first glance that these are entirely 198 distinct and unrelated. However, there is a common underlying cause: 199 that MAC addresses provide no location information. 201 Globally-unique MAC addresses are structured such that the first 202 three bytes of a device's address contain an organisationally unique 203 identifier (OUI) allocated to the device's manufacturer by the IEEE, 204 with the remaining three bytes allocated by the manufacturer. This 205 hierarchy exists solely for the purpose of allocating unique 206 addresses in a decentralised fashion, and is of no use to Ethernet 207 switches, which must treat the unicast address space as flat. 209 A flat address space has the advantage that no configuration of 210 devices is required; a device can use its unique, manufacturer- 211 assigned MAC address anywhere on any network. However, this leaves 212 each switch with the task of discovering and storing the location of 213 every addressable device. 215 If the MAC address space were not flat, but instead contained enough 216 information to locate the device possessing the address, several 217 advantages would be gained. Firstly, large forwarding databases 218 would no longer have to be maintained on every switch. This location 219 information could instead be distributed across the network so that 220 frames are directed towards their destinations according to 221 successive stages of a hierarchy. 223 Secondly, a hierarchical MAC address space would also make the 224 addition of shortest-path routing considerably easier. Shortest-path 225 routing is clearly a desirable property for a network, yet it is one 226 that Ethernet does not provide. Flat addressing does not lend itself 227 to easy routing: any address can be located anywhere on the network, 228 which means either advertising every host's MAC address via the 229 routing protocol--which scales very poorly--or providing some other 230 location lookup service. The use of hierarchical addresses, with 231 each switch handling a block of sequential addresses akin to an IP 232 subnet, would reduce the routing problem to the one that routing 233 protocols were designed to solve. 235 Thirdly, this would allow for reduction of broadcast traffic in a 236 variety of different ways. Hierarchical MAC addresses could, for 237 example, be mapped directly and deterministically onto the IP address 238 space, if appropriate for the specific deployment. This would allow 239 switches to respond directly and simply to DHCP and ARP queries, 240 avoiding the need to forward the most common sources of broadcast 241 frames. Alternatively, a distributed directory service can be used, 242 which is less limiting and is thus our preferred approach as detailed 243 below. 245 The facility for network administrators to assign locally 246 administered addresses (LAAs) to devices has existed for as long as 247 Ethernet. However, configuring and maintaining the LAA on every 248 device based upon where they are connected would be a considerable 249 and unwelcome administrative overhead. We therefore present MOOSE, a 250 system for applying hierarchical addressing to an Ethernet 251 transparently and without any configuration to edge devices. 253 3. Related Work 255 It is well-known that traditional Ethernet scales poorly, and there 256 have been various attempts in recent years to rectify this. The most 257 widely-used of these in real-world networks is MPLS-VPLS [RFC3031] 258 (Multiprotocol Label Switching--Virtual Private LAN Service). This 259 connects Ethernet islands together through tunnels across a MPLS 260 cloud. MPLS works by adding one or more labels to the start of every 261 frame, i.e. encapsulating the frame inside its own protocol. 263 In MPLS-VPLS, the label edge routers (LERs) must determine the 264 frame's initial label(s) based upon the destination address via a 265 lookup table. Frames follow prenegotiated label-switched paths 266 (LSPs) that, unlike Ethernet, are not constrained to follow a 267 spanning tree; LSPs are precomputed at connection setup time and the 268 relevant next hop is stored in a lookup table on each intermediate 269 switch. Each switch must hence use each frame's label to index into 270 this lookup table to determine how to switch the frame. 272 The effect, once the connection has been negotiated, is to provide 273 what appears to be one or more large Ethernet networks, transparently 274 overlaid on the MPLS cloud. Whilst this solves effectively the 275 problem of shortest-path routing across the MPLS cloud, the overlay 276 Ethernets are still susceptible to the usual scalability problems-- 277 and in fact VPLS adds further large lookup tables on every switch 278 that can in some configurations scale even worse than Ethernet's 279 forwarding databases. LERs must map every MAC address to a LSP; 280 label switch routers (LSRs) must store the next hop for every LSP in 281 which they participate, which in the core of the network could scale 282 as O(hosts^2). 284 A similar scheme is proposed by Hadzic [Ha01], with the difference 285 that Ethernet-inside-Ethernet encapsulation is used rather than a new 286 protocol. This has the advantage that less processing is required on 287 intermediate switches in the backbone network. However, routes 288 across the backbone are constrained to a spanning tree, and 289 encapsulating switches must obtain a new destination address for 290 every frame using a lookup table that--like Ethernet's forwarding 291 database--must contain every transmitting MAC address. Due to its 292 heavy basis on Ethernet, this shares many of Ethernet's scalability 293 problems. 295 SmartBridge [Ro00] and RBridges [Pe04] (TRILL [RFC5556]) both 296 encapsulate Ethernet frames in a new inter-switch protocol, and run a 297 link-state routing protocol between switches. The link state graph 298 includes the location of every MAC address--necessary because the 299 address space remains flat and any address could appear 300 anywhere--i.e. it again contains every host. Furthermore, switches 301 must perform expensive computation to update routing tables whenever 302 a MAC address joins or leaves the network. 304 Myers et al [My04] suggest that Ethernet's main failing is its 305 broadcast service, and propose a new architecture in which hosts make 306 explicit use of directory services operated by switches rather than 307 broadcasting queries. It is clear that switches' participation is 308 necessary in order to deal with the broadcast problem; however the 309 modifications to Ethernet suggested are not backwards-compatible and 310 would require at least software modifications to all connected 311 devices. Ethernet is, perhaps unfortunately, too widespread for this 312 to be practical; transparent interception of broadcast frames and 313 subsequent local handling or redirection via multicast or unicast 314 remains the only practical solution. The use of hierarchical 315 addressing is a useful stepping-stone to such a system, and our 316 architecture includes a transparent directory service (ELK) for this 317 purpose. 319 SEATTLE [Ki08] takes a more scalable approach. A routing protocol is 320 operated between switches, but in contrast to the approaches 321 described above and in common with MOOSE, the routing protocol only 322 propagates switch location information, rather than every MAC address 323 on the network. Flat MAC addresses are still used, and hence a 324 mechanism is required to look up the switch to which a given address 325 is connected. This is achieved by using a distributed hash table 326 (DHT) operating on participating switches with local caching to 327 alleviate load. This is certainly a step in the right direction but 328 introduces considerable complexity to switches, since they now must 329 maintain and update the DHT continually, and it is clear that a 330 SEATTLE switch would have a significant software component in the 331 data path. MOOSE alleviates some of the complexity of SEATTLE by a 332 combination of hierarchical addresses and delegation to a separate 333 directory service. 335 4. MOOSE Architecture 337 The basic operation of MOOSE is to assign a new hierarchical MAC 338 address to each host on the network, assigned dynamically and 339 automatically from the unicast LAA space. This dynamically-assigned 340 address is referred to as a MOOSE address to avoid confusion with 341 hosts' static, manufacturer-assigned MAC addresses. 343 Every frame entering the network has its source address rewritten in- 344 place to the sending host's MOOSE address by the first MOOSE-aware 345 switch it traverses. The switch that performs address rewriting for 346 a host--i.e. the closest MOOSE switch to that host--is the host's 347 home switch and is responsible for assigning a MOOSE address to that 348 host. (If non-MOOSE switches or hubs are in use, a host may have 349 more than one "closest" MOOSE switch, in which case an RSTP-like 350 protocol must be used to elect a switch to handle each edge segment.) 352 The destination address is left intact in the expectation that it 353 already is a MOOSE address. Hosts' ARP caches will already contain 354 the MOOSE addresses of any hosts being communicated with as any 355 packet received will already have had its source address rewritten; a 356 host's manufacturer-assigned MAC address is never seen outside of the 357 segment containing that host. This is a crucial point since 358 encapsulation-based technologies such as MPLS do not reveal to the 359 destination host the address used for routing; as a result, switches 360 must also convert destination as well as source addresses of frames 361 entering the network. In other words, once again switches must 362 maintain large tables of remote hosts on the network. The only 363 destination rewriting that MOOSE switches perform, however, is of the 364 destination addresses of frames destined for local hosts back to 365 their manufacturer-assigned MAC addresses; this is simple as the 366 required information is already known, and necessary because 367 otherwise that host's network interface card would discard the frame 368 as misaddressed. 370 A MOOSE address consists of a switch identifier followed by a host 371 identifier. For our examples, we simply use a fixed three-byte 372 switch identifier followed by a fixed three-byte host identifier: 374 +----------+ +----------+ 375 | switch |_____| switch |_ _ _ _ hosts 02:22:22:00:00:01, 376 | 02:11:11 | | 02:22:22 | 02:22:22:00:00:02, etc. 377 +----------+ +----------+ 378 | 379 | 380 +----------+ 381 | switch |_ _ _ _ hosts 02:33:33:00:00:01, 382 | 02:33:33 | 02:33:33:00:00:02, etc. 383 +----------+ 385 Since these two identifiers when concatenated must form a unicast 386 LAA, the settings of two bits in the first byte of the switch 387 identifier are fixed: the least significant bit must be 0 to indicate 388 a unicast address, and the second-least significant bit must be 1 to 389 indicate a LAA. To cater for variable length switch identifiers, 390 some means of introducing separation between the switch and host 391 identifiers is required. Two possible implementations would be for: 393 1. the first three bits of the address to indicate how many of the 394 following 5-bit blocks make up the switch prefix; 396 2. some constant delimiter to appear between the switch identifier 397 and host identifier, with switch identifiers not allowed to 398 contain the delimiter. 400 The former is simple and gives eight classes of switch identifier. 401 Because the size of a MOOSE network is limited by the placement of IP 402 routers, these classes should be sufficient. Additionally, because 403 switches are free to change their identifiers, they may trivially 404 switch to a larger class if they have too many attached hosts, or if 405 a smaller class becomes full. 407 The latter removes the fixed classes, allowing for more flexibility 408 with the sizes of switch identifiers, at the cost of complexity, and 409 a reduction in the available address space. 411 Each switch can select for itself a unique switch identifier, as 412 identifier conflict resolution is cheap (see below). When first 413 joining the routing protocol, conflict should be very unlikely, as 414 the switch will in the process gain an up-to-date list of in-use 415 identifiers. Depending on requirements, the switch identifier may 416 itself be a hierarchical address--e.g. six bits to identify a network 417 area followed by two bytes to identify a switch within that area-- 418 which could then be used to aid routing decisions. 420 Each host is assigned a host identifier by its home switch from the 421 pool of identifiers available to that switch. Only a host's home 422 switch ever bases a switching decision on the host identifier, so the 423 detail of how these are allocated can vary from switch to switch. 424 Suitable schemes include: 426 1. sequential assignment; 428 2. the port number followed by a sequential portion (to allow for 429 multiple hosts connected to one port); 431 3. a hash of the host's real MAC address. 433 The latter two approaches are preferable to a simple sequential 434 assignment, as they better isolate certain kinds of denial-of-service 435 attack in which a malicious host attempts to use up all available 436 host identifiers on the switch. They also require less state to be 437 shared between ports. The third option has the further advantage 438 that it is deterministic and hence can be recovered easily in the 439 event of a crash. 441 It is hence possible to route frames through the network to remote 442 hosts by simply inspecting the switch identifier in the destination 443 address, and ignoring the host identifier until the frame reaches the 444 destination host's home switch. Switches no longer need to keep a 445 table of all MAC addresses seen recently; they only need store the 446 locations of other switches and of any directly-connected hosts. 448 As well as reducing the amount of data that must be consulted in 449 order to make switching decisions, this provides extra resilience by 450 making this data much more predictable. The number of MAC addresses 451 in a network can increase unexpectedly in the event of an address 452 flooding attack or even under normal operation if the network 453 contains open wireless access points; relying on the MAC address list 454 for forwarding leads to some of the vulnerabilities of Ethernet. The 455 set of switch identifiers participating in MOOSE switching, on the 456 other hand, is kept predictable and manageable by ensuring that 457 neighbouring switches (discovered using LLDP [802.1AB]) are 458 authenticated before they can participate in the routing protocol. 459 This authentication can be achieved at layer 3 using the security 460 features found in most popular routing protocols and/or at layer 2 461 [802.1X]. As the switch identifier is the only address consulted for 462 forwarding decisions, a MOOSE switch is likely to remain reliable in 463 the face of attacks that could have brought down a traditional 464 Ethernet. Furthermore, any attacks based upon MAC address spoofing 465 cannot function on a MOOSE network as the user-provided MAC address 466 is translated immediately. 468 4.1. Shortest Path Routing 470 As described so far, MOOSE switches must still forward frames along a 471 spanning tree. As discussed above, this is an undesirable property 472 of Ethernet as it can cause frames to take a highly suboptimal path 473 through the network. The foundations are in place to do much better 474 than this using shortest-path routing. 476 For the purpose of frame forwarding, a MOOSE switch can be considered 477 akin to a layer 3 router; it has one locally-connected subnet-- 478 containing all addresses starting with its switch identifier--and 479 delivers frames to other subnets by passing them to an appropriate 480 neighbouring switch. Bearing this in mind, the switch can run a 481 routing protocol of the kind normally used for IP, such as a variant 482 of OSPF [RFC2328]. This allows frames to be routed along the 483 shortest available path, rather than being constrained to a spanning 484 tree. A multipath variant such as OSPF-OMP may be particularly 485 desirable due to its ability to make use of multiple equal-cost 486 routing paths in order to improve performance. 488 4.2. Address Selection and Conflict Resolution 490 For reasons akin to those of the flaws of Ethernet, it is undesirable 491 to guarantee universally unique pre-determined MOOSE switch 492 identifiers. Due to the reduced size of the switch ID space compared 493 to the MAC address space, this would also be infeasible. We 494 therefore propose that each switch selects an initial address for 495 itself during startup. This could result in more than one switch 496 claiming an address, which would be undesirable, so to mitigate the 497 potential for MOOSE addresses to find themselves in conflict we 498 additionally propose a simple and inexpensive conflict resolution 499 protocol. 501 Suppose two switches each have the same identifer. We note that if 502 these switches are on separate MOOSE networks (on disconnected 503 networks, or separated by an IP router), this situation brings no 504 issue. Should they be on the same MOOSE network, however, a conflict 505 exists and must be resolved. Any routing protocol would require a 506 switch to know which port other switches are connected to, for 507 instance by OSPF neighbour lists, or simply by receiving frames and 508 noting the switch port and source MOOSE address. When a switch 509 receives a MOOSE frame, it looks up the source switch in its 510 forwarding database, which is likely in fast Content Addressable 511 Memory. If it finds that source switch to be on a port other than 512 that which it recognises from its table, one of three situations may 513 be possible: 515 1. the source switch may be the same as the known switch, and have 516 physically moved, or a topology change has occurred; 518 2. the source switch may be a different one to the known switch, and 519 they are in conflict; 521 3. the source switch may be the same as the known switch, but is 522 sending frames down a different route to the last used route. 524 To avoid disruption to the network in the first case, and to give 525 scope for switches to migrate within the network, the switch which 526 detected the possible conflict should ascertain whether the known 527 switch is still alive and present. The conflict-resolving switch 528 thus attempts to send a unicast frame to the known switch, via the 529 port stored in the forwarding database, asking whether it is there at 530 a regular interval until a timeout. This will reach the known switch 531 rather than the new switch if it is still present as other switches 532 beyond that port must not have detected the conflict yet. The nature 533 of the timeout we leave unspecified, and can be implementation 534 specific. It may, for instance, be a pre-defined constant, or it may 535 vary based on QoS information gathered if such capabilities are 536 supported. When a MOOSE switch receives such a frame, it should 537 promptly respond with an acknowledgement frame, showing that it is 538 alive. 540 If, within the timeout period, the conflict resolver finds the known 541 host not to be alive, no conflict exists, so the switch updates its 542 view of the network by removing the old entry from its forwarding 543 database and triggering a routing protocol refresh. 545 If, on the other hand, the host is found to be alive, a conflict 546 exists. The conflict resolver then sends a frame to the more 547 recently found switch indicating that it is in conflict and should 548 change its address. That switch, upon receiving this frame, changes 549 its address and sends a gratuitous ARP for each of its connected 550 hosts, so that the rest of the network is aware of the change. To 551 mitigate the risks of a denial of service attack, or faulty equipment 552 sending out conflict frames, an exponential backoff algorithm should 553 be used when receiving conflict notification frames. 555 A switch should have a timer, and counter influencing the maximum 556 value of the timer, both initialised to 0. When a conflict 557 notification frame is received, the counter is incremented (subject 558 to a saturation value to avoid excessive timeouts). After a conflict 559 has been resolved--i.e. the switch has changed its address--a timer 560 starts counting down from some time exponential in that counter; 561 subsequently the switch will only change its address if the timer has 562 returned to 0 by the time the conflict frame is received. The 563 counter should be reset to 0 when the timer reaches 0. Using this 564 scheme the event of true conflict is handled quickly, even in the 565 unlikely case that the newly acquired address is also in conflict. 566 Any node emitting malicious or erroneous conflict notifications, 567 however, is rate-limited enough that their damage potential is much 568 restricted, subject to a sufficient timer being chosen. 570 Pseudocode: Conflict resolution backoff: 572 if timer > 0: 573 if counter < counter_max: 574 counter = counter + 1 575 # Discard conflict notification frame 576 else: 577 timer = k^counter 578 change_address() 580 Pseudocode: Conflict resolution timer: 582 foreach clock tick do: 583 if timer > 0: 584 timer = timer - 1 585 else: 586 counter = 0 588 This could be further enhanced by detecting repeated conflicts 589 involving the same switch or switches, in a manner similar to BGP 590 Route Flap Damping [RFC2439], and performing more aggressive steps to 591 avoid further conflicts--for example using a significantly increased 592 timeout, and/or having *both* switches in conflict select new 593 addresses. 595 The conflict resolution algorithm brings a marked improvement on the 596 equivilent vulnerability of Ethernet, that MAC addresses can be 597 spoofed. We build in a flexible, well-defined system of recovery. 598 The decentralised nature of the system makes it much less open to 599 denial of service attack than any centralised directory may be. 600 Having every MOOSE switch acting as a barrier to the propagation of 601 packets from addresses in conflict provides a strong separation 602 between recently bridged networks with conflicting addresses, so that 603 communication within the individual networks may continue without 604 modification, until bridge-crossing traffic appears, at which point 605 resolution quickly happens. We also remove the possibility for 606 forwarding databases to frequenty have to switch their entry for a 607 conflicted address, which can happen with MAC conflicts in 608 traditional Ethernet. Additionally, in the case of a switch 609 identifier spoofing attack, the conflict resolver acts as a hard 610 boundary for the effects of such an attack. 612 It is possible that the switch performing conflict resolution could 613 send a suggested replacement switch address to the switch in 614 conflict, known by the conflict resolver to have a low probability of 615 being present on the network (because it is not present in its 616 forwarding database). This would reduce the chance of repeated 617 collisions, and potentially allow for longer backoff periods, but may 618 be premature optimisation. 620 Because multi-path routing is often desirable, we could introduce an 621 extra datum during the source address rewriting performed by MOOSE 622 switches. When an ingress MOOSE switch rewrites the source address 623 of an Ethernet frame to a MOOSE address, it could also prepend some 624 hash of its manufacturer-assigned MAC address to the data field, and 625 increment the length field as necessary. The egress switch, when 626 rewriting the MOOSE destination address to a host's MAC address, then 627 strips out this added datum. This allows the conflict resolver to 628 check whether conflicts actually exist by local lookup, rather than 629 probing other switches, at the cost of added memory requirements in 630 every switch. This may push the frame to be larger than Ethernet's 631 maximum, so may require fragmenting the packet into two, at small 632 added cost. Alternatively, assuming jumbo frames are permitted by 633 the hardware, the maximum frame size could be marginally reduced to 634 allow for this in the same manner as for 802.1Q VLAN tags. 636 From the cheapness of conflict resolution, certain other address 637 management tasks become simple. A switch is free to choose its 638 address when it joins the network however it wishes--attempting to 639 re-use its last-used address, from a list of preferred addresses, or 640 by generating an address entirely at random. More intricate 641 addressing schemes may be used on managed networks if desired, 642 perhaps encapsulating deeper layers of hierarchy. 644 4.3. Broadcast and Multicast 646 Since Ethernet does still need to support arbitrary broadcast frames, 647 these must still be forwarded along a spanning tree in order that 648 they reach each host exactly once. An explicit spanning tree 649 protocol is not required however, as the tree can be deduced from the 650 routing table via reverse path forwarding in a similar manner to 651 Protocol-Independent Multicast (PIM) [RFC3973]. In other words, 652 broadcast packets are routed as if they had been sent to the all- 653 hosts multicast group. 655 More general multicast groups can be implemented using a combination 656 of IGMP snooping [RFC4541] as used by modern Ethernet switches, and 657 participation of the MOOSE switches in PIM routing. 659 4.4. Example 661 To illustrate the basic behaviour of MOOSE switches, before we go on 662 to describe further features, we will offer a simple example. We 663 will describe the steps involved in forwarding a broadcast frame 664 containing a query in some higher-layer IPv4-based protocol, and 665 subsequent unicast frame containing the response, between two hosts A 666 and B via three MOOSE switches 02:11:11, 02:22:22 and 02:33:33. 668 4.4.1. Query 670 1. Host A transmits the broadcast query frame as it would on any 671 Ethernet network, with its own manufacturer-assigned MAC address 672 in the Ethernet header's source field and the broadcast address 673 (FF:FF:FF:FF:FF:FF) in the destination field. 675 2. The frame is received by switch 02:11:11, which observes the non- 676 MOOSE address in the frame's source field, and rewrites the 677 source field into a MOOSE address containing the switch 678 identifier and the appropriate host identifier. As this is Host 679 A's first frame, the switch must allocate a host identifier (in 680 this case 00:00:01, making Host A's complete MOOSE address 02:11: 681 11:00:00:01). 683 3. The three switches broadcast the frame using reverse path 684 forwarding away from Host A. 686 4. The frame is received by Host B (and any other hosts on the 687 network) in its current form; no further rewriting is performed. 689 4.4.2. Response 691 1. Host B looks up Host A's IP address in its ARP cache to determine 692 a suitable destination address for the response frame. Since the 693 rewritten query frame arrived at Host B with the source field 694 containing the MOOSE address 02:11:11:00:00:01, this is the 695 address returned by the cache lookup. 697 2. As above, switch 02:33:33 assigns a MOOSE address to Host B (02: 698 33:33:00:00:01) and rewrites the source address of the frame. 700 3. The frame is now routed through the network based solely on the 701 destination switch identifier--the host identifier is ignored for 702 now. The routing table is consulted for the location of switch 703 02:11:11 and the frame is forwarded accordingly. 705 4. On receiving the frame, switch 02:11:11 observes that it is 706 destined for a directly-connected host (02:11:11:00:00:01). It 707 prepares the frame for transmission along its final hop by 708 rewriting the destination address to Host A's manufacturer- 709 assigned MAC address. The source field of the frame is again 710 left as the MOOSE address of Host B in order that this address is 711 used for any further communication with Host B. 713 4.5. Directory Service 715 A directory service, Enhanced Lookup (ELK), runs in conjunction with 716 the basic MOOSE switch described so far. ELK exists to handle ARP 717 and DHCP queries in a broadcast-free manner by learning mappings from 718 IP addresses to MOOSE addresses. The master ELK directory is served 719 by one or multiple systems for resilience and is reached using an 720 anycast MOOSE address; the layer-2 anycast feature is a convenient 721 side-effect of running a routing protocol. Slave copies of the 722 directory can be held nearer the edge of the network in order to take 723 load away from the masters; slaves can be reached for lookups via a 724 separate anycast address, and the entire herd of ELK can be kept 725 synchronised via the masters using a combination of multicast and 726 unicast. 728 MOOSE switches intercept ARP and DHCP packets broadcast by hosts and 729 convert them into anycast ELK queries to the nearest slave (for ARP) 730 or master (for DHCP). (DHCP handling could make use of the 731 protocol's existing DHCP relay mechanism.) The ELK slave answers ARP 732 queries directly using information in the directory; as it does so, 733 if the query is from a host not in the directory, it learns the 734 sender's IP address to MOOSE address mapping. The ELK master can 735 also act as a DHCP server, populating the ELK directory as it grants 736 IP address leases to clients. 738 The one case in which the ELK directory will not contain the answer 739 to a query is when answering an ARP request for a host that is not 740 configured to use DHCP and that has not yet itself sent an ARP packet 741 (i.e. has not yet communicated via IP). This must be dealt with by 742 flooding the query to every active switch port, in a manner akin to 743 current Ethernet switches, and caching the result in the ELK 744 directory. Although this is not ideal, it is necessary in order to 745 deal with this scenario in a compatible manner, and is unlikely to 746 happen frequently. 748 4.6. Mobility 750 A consequence of introducing location-based hierarchy into MAC 751 addresses is the need to explicitly handle host mobility. In a 752 traditional Ethernet, hosts can migrate between switches as the 753 switches will learn the host's new location as soon as it sends a 754 frame. With MOOSE, if a host relocates to a new switch its address 755 changes and any ARP cache entries on other hosts pertaining to the 756 migrated host become incorrect; frames will continue to be sent to 757 the host's old location for a while. There are two strategies for 758 dealing with this, which can be used separately or in conjunction: 760 1. The previous home switch of the migrated host can forward frames 761 sent to the host's old address until outdated ARP cache entries 762 expire. This is similar to IP Mobility: the previous home switch 763 essentially becomes a care-of agent for the host. However, 764 unlike IP Mobility, it requires no host support. A handover 765 protocol is necessary for the old and new home switches to set up 766 such forwarding: on the arrival of a new host at a switch, that 767 switch would ask all other switches (via multicast) whether any 768 had seen this host before, identifying it using its manufacturer- 769 assigned MAC address, and would instruct such switches to 770 redirect frames. 772 2. A broadcast ARP announcement (or "gratuitous ARP") can be sent by 773 the new home switch to immediately update remote ARP caches and 774 the ELK directory with the new MOOSE address. This is the 775 technique used by Xen when migrating live virtual machines. 776 Unlike the previous approach, this works even if the previous 777 switch is no longer reachable, for example if this host migration 778 was as a result of a switch failure. This is a simpler approach 779 as a handover protocol is not required, but results in additional 780 broadcast traffic. 782 Unless the frequency of host migrations is very high, the additional 783 load introduced by either mobility approach is expected to be 784 negligible. 786 Illustration of the two ways to handle a host A roaming onto another 787 switch whilst maintaining communication with another host B: 789 (1) +--------+ 790 ##============== | Host B | <=== ARP ===## (2) gratuitous 791 || +--------+ || ARP sent by 792 || | || new home switch 793 || +---+ || 794 || .------------| X |------------. || 795 || / +---+ \ || 796 \/ | | || 797 +---+ (1) data forwarded +---+ 798 | X | ==========================> | X | 799 +---+ by care-of switch ||+---+ 800 | \/ | 801 + - - - + +--------+ 802 | |- - host relocated to - >| Host A | 803 + - - - + new switch +--------+ 805 5. Interoperability Considerations 807 5.1. Layer-violating Protocols 809 In an ideal world, free from layering violations, all layer 3 810 protocols would operate correctly on top of MOOSE in exactly the same 811 way that they currently operate on top of Ethernet, with no protocol- 812 specific handling necessary in the switch. In reality, however, 813 protocols abound which use hosts' MAC addresses for purposes other 814 than layer 2 addressing or which place MAC addresses in the frame 815 payload. DHCP and ARP have already been mentioned as such protocols 816 which must be specifically handled by edge switches in order to 817 operate; luckily, the rewriting required for these important 818 protocols is simple. 820 Of particular concern are recent standards for layering on top of 821 Ethernet protocols which were previously used solely on dedicated 822 hardware interconnects, such as Fibre Channel over Ethernet (FCoE 823 [FC-BB-5]). In order to support FCoE and similar protocols on a 824 MOOSE network, each edge switch will need to be able to interpret and 825 rewrite individual protocols that are in use. A production MOOSE 826 switch would, therefore, need to be implemented such that it is 827 possible to add rewriting support for additional protocols after 828 manufacture, for example by loading an additional software or FPGA 829 configuration module. 831 Ultimately, in the general case, this problem could be addressed more 832 satisfactorily by extending the Ethernet standard to provide a 833 protocol-agnostic method for a layer 2 network to inform hosts of 834 their own addresses; LLDP [802.1AB] would make a good basis for this 835 extension. This would allow the use of network-assigned MAC 836 addresses for any protocol, with some rewriting performed either 837 partially (within the frame payload) or fully by the host itself, and 838 furthermore would allow higher-layer protocols to respond to changes 839 of the host's network-assigned address (e.g. due to mobility). Such 840 a mechanism could be deployed incrementally as needed, with switches 841 able to perform address rewriting for hosts which are not able to do 842 this themselves. This is, however, a very long-term solution, and 843 protocol-specific rewriting on the switch is likely to be required 844 for the foreseeable future. 846 FCoE in particular is unusual, however, as it already does its own 847 dynamic allocation of MAC address to devices. It is conceivable that 848 an extension to FCoE could be developed which allows a network-wide 849 dynamic address assignment scheme such as MOOSE to be exploited to 850 provide addresses directly to fibre channel devices. 852 5.2. Edge Virtual Bridging 854 The rise of virtualisation has caused an unanticipated proliferation 855 of software switches, usually in the host operating system or 856 hypervisor which provides network connectivity to multiple virtual 857 machines. Since software switches are almost always neither fast nor 858 centrally manageable in the same way as hardware switches, there is 859 ongoing work to standardise--by Cisco as Port Extension and by the 860 IEEE as Edge Virtual Bridging [P802.1Qbg]--a means of making these 861 software switches act merely as additional ports which are logically 862 part of a more central hardware switch. This reduces the work 863 required by a virtual edge switch: frames from local virtual edge 864 ports can be forwarded straight out via the uplink to a physical 865 switch without consideration, and frames from the uplink will arrive 866 simply tagged with a virtual edge port identifier. 868 (The scope of Port Extension in particular is greater than this, and 869 allows for physical port extenders to exist in place of switches 870 where a large number of ports but a small amount of processing is 871 required, but virtualisation is likely to be the most significant use 872 case.) 874 Edge Virtual Bridging and Port Extension require very little 875 adaptation to be implemented on a MOOSE switch. It is unlikely, 876 although too early in the standardisation process to say for certain, 877 that the virtual bridge will need to be MOOSE-aware. A virtual- 878 bridging-aware physical MOOSE switch will thus simply need to take 879 into account the possibility that one physical port may hide a large 880 number of virtual ports when allocating host identifiers, as it would 881 if it had an Ethernet switch connected on that port. If, however, 882 the virtual bridge is made MOOSE-aware, the hierarchical addressing 883 of MOOSE could be exploited to allow the virtual bridge to allocate 884 host identifiers itself, given that it is likely to be aware of the 885 exact number and nature of virtual edge ports. The parent MOOSE 886 switch would accordingly allocate an address prefix to each child 887 virtual bridge, and hosts' full MOOSE addresses could be formed as: 889 SWITCH ID : CHILD ID : HOST ID 890 (parent) (allocated (allocated 891 by parent) by child) 893 6. Prototype Implementation 895 We have implemented a MOOSE switch in OpenFlow and NOX, which can be 896 run on off-the-shelf switches. Details can be found in our paper 897 [Wa10]. 899 7. Conclusions 901 Ethernet remains popular due to its simplicity and ubiquity, but is 902 showing its age and exhibits serious scalability issues in large 903 deployments. Previously-proposed improvements address either a few 904 of the problems in a simple way, or most of the problems in a highly 905 complex or backwards-incompatible way. We have demonstrated a 906 simple, novel and easily-implementable approach for significantly 907 boosting the scalability of Ethernet, which has a working prototype 908 switch firmware implementation. 910 8. IANA Considerations 912 This memo includes no request to IANA. 914 9. Security Considerations 916 Security will be considered in a later revision of this document. 918 10. Informative References 920 [802.1AB] IEEE, "802.1AB: Station and Media Access Control 921 Connectivity Discovery", 2009. 923 [802.1D] IEEE, "802.1D: Standard for Local and Metropolitan Area 924 Networks: Media Access Control (MAC)", 2004. 926 [802.1X] IEEE, "802.1X: Port Based Network Access Control", 2004. 928 [Cl05] Clark, C. and others, "Live Migration of Virtual 929 Machines", USENIX NSDI 2005, 2005. 931 [FC-BB-5] T11 FC-BB-5 working group, "Fibre Channel Backbone - 5", 932 June 2009. 934 [Ha01] Hadzic, I., "Hierarchical MAC Address Space in Public 935 Ethernet Networks", IEEE GLOBECOM vol 3, 2001, 2001. 937 [Ki08] Kim, C., Caesar, M., and J. Rexford, "Floodless in 938 SEATTLE: A Scalable Ethernet Architecture for Large 939 Enterprises", ACM SIGCOMM 2008, 2008. 941 [My04] Myers, A., Ng, E., and H. Zhang, "Rethinking the Service 942 Model: Scaling Ethernet to a Million Nodes", ACM SIGCOMM 943 Workshop on Hot Topics in Networking 2004, November 2004. 945 [P802.1Qbg] 946 Jeffree, A., Congdon, P., and J. Pelissier, "P802.1Qbg: 947 Edge Virtual Bridging", September 2009. 949 [Pe04] Perlman, R., "RBridges: Transparent Routing", Proc. 950 INFOCOM vol 2, 2005, March 2004. 952 [RFC0826] Plummer, D., "Ethernet Address Resolution Protocol: Or 953 converting network protocol addresses to 48.bit Ethernet 954 address for transmission on Ethernet hardware", STD 37, 955 RFC 826, November 1982. 957 [RFC2131] Droms, R., "Dynamic Host Configuration Protocol", 958 RFC 2131, March 1997. 960 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998. 962 [RFC2439] Villamizar, C., Chandra, R., and R. Govindan, "BGP Route 963 Flap Damping", RFC 2439, November 1998. 965 [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol 966 Label Switching Architecture", RFC 3031, January 2001. 968 [RFC3344] Perkins, C., "IP Mobility Support for IPv4", RFC 3344, 969 August 2002. 971 [RFC3973] Adams, A., Nicholas, J., and W. Siadak, "Protocol 972 Independent Multicast - Dense Mode (PIM-DM): Protocol 973 Specification (Revised)", RFC 3973, January 2005. 975 [RFC4541] Christensen, M., Kimball, K., and F. Solensky, 976 "Considerations for Internet Group Management Protocol 977 (IGMP) and Multicast Listener Discovery (MLD) Snooping 978 Switches", RFC 4541, May 2006. 980 [RFC5556] Touch, J. and R. Perlman, "Transparent Interconnection of 981 Lots of Links (TRILL): Problem and Applicability 982 Statement", RFC 5556, May 2009. 984 [Ro00] Rodeheffer, T., Thekkath, C., and D. Anderson, 985 "SmartBridge: A Scalable Bridge Architecture", ACM 986 SIGCOMM 2000, 2000. 988 [Wa10] Wagner-Hall, D., "A Prototype Implementation of MOOSE on a 989 NetFPGA/OpenFlow/NOX Stack", First European NetFPGA 990 Developers' Workshop Cambridge, September 2010. 992 Authors' Addresses 994 Malcolm Scott (editor) 995 University of Cambridge 996 15 JJ Thomson Ave 997 Cambridge, CB3 0FD 998 UK 1000 Phone: +44 1223 763500 1001 Fax: +44 1223 334678 1002 Email: Malcolm.Scott@cl.cam.ac.uk 1003 URI: http://www.cl.cam.ac.uk/~mas90/MOOSE/ 1005 Daniel Wagner-Hall 1006 University of Cambridge 1008 Email: dwh@cantab.net 1009 Jon Crowcroft 1010 University of Cambridge 1011 15 JJ Thomson Ave 1012 Cambridge, CB3 0FD 1013 UK 1015 Phone: +44 1223 763500 1016 Fax: +44 1223 334678 1017 Email: Jon.Crowcroft@cl.cam.ac.uk