idnits 2.17.1 draft-lapukhov-ila-deployment-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([I-D.herbert-nvo3-ila]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 31, 2016) is 2734 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC4760' is defined on line 1288, but no explicit reference was found in the text -- Obsolete informational reference (is this intentional?): RFC 3633 (Obsoleted by RFC 8415) -- Obsolete informational reference (is this intentional?): RFC 6830 (Obsoleted by RFC 9300, RFC 9301) == Outdated reference: A later version (-04) exists of draft-herbert-nvo3-ila-03 == Outdated reference: A later version (-04) exists of draft-lapukhov-bgp-opaque-signaling-02 == Outdated reference: A later version (-02) exists of draft-lapukhov-bgp-ila-afi-01 Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group P. Lapukhov 3 Internet-Draft Facebook 4 Intended status: Informational October 31, 2016 5 Expires: May 4, 2017 7 Deploying Identifier-Locator Addressing (ILA) in datacenter networks 8 draft-lapukhov-ila-deployment-01 10 Abstract 12 Identifier-Locator Addressing architecture defined in 13 [I-D.herbert-nvo3-ila] proposes the use of locator-identifier split 14 in IPv6 address to realize workload mobility and more efficient use 15 of network resources. This document describes how ILA can be 16 implemented in datacenter using BGP as the control-plane protocol. 17 Generally speaking, ILA could be built using different control 18 planes, and BGP is one particular instantiation. The motivation is 19 BGP being a well-known protocol, sufficient for small to medium size 20 deployments, on scale of few millions of identifier to locator 21 mappings. Defining more generic and scalable control plane variants 22 is outside of scope of this document. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on May 4, 2017. 41 Copyright Notice 43 Copyright (c) 2016 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 60 3. ILA deployment process . . . . . . . . . . . . . . . . . . . 5 61 4. Preparing the network . . . . . . . . . . . . . . . . . . . . 6 62 4.1. Data-center network topology . . . . . . . . . . . . . . 6 63 4.2. Configuring locator addressing . . . . . . . . . . . . . 7 64 5. Deploying ILA routers . . . . . . . . . . . . . . . . . . . . 10 65 5.1. ILA Redirect Message . . . . . . . . . . . . . . . . . . 10 66 5.2. Configuration parameters . . . . . . . . . . . . . . . . 10 67 5.3. ILA router operation . . . . . . . . . . . . . . . . . . 11 68 5.4. Scaling considerations . . . . . . . . . . . . . . . . . 12 69 6. Deploying ILA hosts . . . . . . . . . . . . . . . . . . . . . 13 70 6.1. Configuration parameters . . . . . . . . . . . . . . . . 13 71 6.2. Providing task isolation . . . . . . . . . . . . . . . . 13 72 6.3. ILA host operation . . . . . . . . . . . . . . . . . . . 14 73 7. Using BGP as the ILA control plane . . . . . . . . . . . . . 16 74 7.1. BGP topology . . . . . . . . . . . . . . . . . . . . . . 16 75 7.2. Any-to-any mapping distribution . . . . . . . . . . . . . 17 76 7.3. Hub-and-spoke mapping distribution . . . . . . . . . . . 17 77 8. Push vs pull mapping distribution modes . . . . . . . . . . . 18 78 9. ILA address management . . . . . . . . . . . . . . . . . . . 18 79 9.1. Decentralized address management . . . . . . . . . . . . 18 80 9.2. Centralized address management . . . . . . . . . . . . . 19 81 9.3. Role of Task scheduler . . . . . . . . . . . . . . . . . 19 82 10. ILA domain federation . . . . . . . . . . . . . . . . . . . . 20 83 11. Operational Considerations . . . . . . . . . . . . . . . . . 20 84 11.1. Operational procedures for ILA routers . . . . . . . . . 21 85 11.2. ICMPv6 Message generation by transit devices . . . . . . 21 86 11.3. Multicast routing . . . . . . . . . . . . . . . . . . . 22 87 11.4. Potential ILA mapping table complications . . . . . . . 22 88 11.5. Potential ILA routers complications . . . . . . . . . . 23 89 12. Deployment Scenario Primer . . . . . . . . . . . . . . . . . 24 90 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 91 14. Manageability Considerations . . . . . . . . . . . . . . . . 25 92 15. Security Considerations . . . . . . . . . . . . . . . . . . . 26 93 15.1. ILA host security . . . . . . . . . . . . . . . . . . . 26 94 15.2. BGP Security . . . . . . . . . . . . . . . . . . . . . . 26 95 15.3. ILA router security . . . . . . . . . . . . . . . . . . 26 96 15.4. Tenant security . . . . . . . . . . . . . . . . . . . . 26 98 16. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 27 99 17. Informative References . . . . . . . . . . . . . . . . . . . 27 100 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 29 102 1. Introduction 104 This document provides high-level guidelines for building an ILA- 105 enabled datacenter using BGP [RFC4271] as the protocol for ILA 106 mapping information dissemination. The reader is expected to be 107 familiar with the principles presented in [I-D.herbert-nvo3-ila]. 108 Reading on ILNP architecture defined in [RFC6740] is also 109 recommended, but not needed for understanding of this document. 110 While ILA does not implement the original ILNP proposal, it's based 111 on the same idea of maintaining the Identifier vs Locator split in 112 the IPv6 address. 114 ILA benefits from routed datacenter networks, i.e. networks that do 115 not rely on spanning Layer-2 domains across multiple network devices. 116 Endpoint mobility made possible by ILA is one of the key benefits ILA 117 brings to the datacenter networks. Combining ILA with fully routed 118 network design allows for achieving the robustness of routed network 119 with the flexibility of endpoint mobility. Some practical 120 recommendations for building a fully-routed datacenter network could 121 be found in [RFC7938] or [ROUTED-DESIGN]. 123 Though workload mobility could also be achieved in L3 switched 124 networks by using "host-route injection" technique, such approach has 125 limited applicability, due to high stress put on the underlying 126 control and data planes. The mobile prefix needs to be removed, re- 127 injected and propagated to all network devices every time an address 128 moves. 130 ILA is an alternative to "encapsulation" approaches, such as LISP 131 ([RFC6830]), for realizing the endpoint mobility and network 132 virtualization. Using simple address rewrites significantly reduces 133 the processing overhead on the hosts, and makes various hardware and 134 software network acceleration functions easier to implement (e.g. 135 checksum computation offload). Furthermore, ILA keeps the underlying 136 network fully visible to the applications that use ILA addresses, 137 which makes network troubleshooting easier, as compared to the 138 "encapsulation" approaches. 140 2. Terminology 142 This section defines ILA-specific terminology that will be used 143 through the document. 145 ILA domain: a collection of ILA hosts and ILA routers that 146 collectively support ILA identifier mobility and network 147 virtualization model. The ILA domain is assigned a single 64-bit 148 IPv6 prefix known as SIR (Standard Identifier Representation, see 149 [I-D.herbert-nvo3-ila]) prefix, which is made known to all hosts 150 and routers in the domain. This prefix is used to construct the 151 complete 128-bit IPv6 addresses for ILA identifies found in the 152 domain. 154 SIR Address: IPv6 address constructed from SIR prefix concatenated 155 with the 64-bit identifier. This is the address visible to the 156 applications and transport layer on ILA hosts. 158 ILA Address: IPv6 address constructed from actual valid 64-bit 159 locator and 64-bit identifier. This address is what being seen by 160 transit network devices - it is expected to be routable in the 161 underlying network. 163 ILA mapping table: The table for mapping identifiers to locators 164 present in ILA host or ILA router. This table is updated either 165 via BGP, or ILA redirect messages. ILA routers maintain full 166 authoritative copy of the table, while ILA hosts may have their 167 own smaller view of the global mapping state. 169 ILA host: network endpoint that is capable of accepting and 170 originating packets with ILA addresses, by performing stateless 171 rewrite between SIR addresses and ILA addresses. The host 172 maintains its own local version of the ILA mapping table and has 173 at least one ILA locator (64-bit prefix) assigned. 175 Non-ILA host: network endpoint that is not aware of ILA addressing 176 structure and does not participate in ILA address translations. 177 To this host, the SIR and ILA addresses look like regular IPv6 178 addresses. 180 ILA router: network endpoint that is responsible for two main 181 functions: 183 * Storing and disseminating the authoritative ILA mapping 184 information within the ILA domain (NVA role per 185 [I-D.ietf-nvo3-arch]). 187 * Serving as the gateway between the ILA-hosts and non-ILA hosts, 188 as well as the gateway for communicating with other ILA domains 189 (NVE role per [I-D.ietf-nvo3-arch]). 191 Task: the unit of mobility in ILA domain. Each task is assigned 192 an identifier unique within the ILA domain, which follows the task 193 as it changes the hosts and, consequently, the locators. 194 Implementation wise, the task can run within a container or a 195 virtual machine, for example. 197 Tenant: owner of the tasks executed in the shared environment. 199 Common Locator Address (CLA): Special ILA address constructed as 200 ::1 and identifying the physical host itself. This 201 address is used to send and receive of the ILA redirect messages. 203 3. ILA deployment process 205 The ILA domain consists of the following conceptual elements: 207 o Routed network that provides reachability among physical hosts, 208 i.e. provides routing within the locator address space. 210 o ILA hosts, each assigned a unique /64 prefix reachable within the 211 network. ILA hosts maintain their own local version of ILA 212 mapping table. 214 o ILA routers, each injecting the domain's SIR prefix into the 215 routed network and maintaining the full mapping table for the ILA 216 domain. The routers could be implemented in software, or using 217 specialized hardware appliances. 219 o Centralized BGP router-reflector nodes that peer with all of the 220 ILA hosts and all of the ILA routers within the domain for the 221 purpose of mapping information dissemination. ILA hosts and 222 routers run the BGP processes to communicate with the reflectors. 224 Deploying ILA in datacenter requires the following logical steps: 226 o Preparing the network. Assigning locator addressing to the hosts 227 (servers) in the network and providing routed interconnection 228 among the locator prefixes. 230 o Configuring ILA hosts and ILA routers. Each ILA domain requires a 231 set of ILA routers to facilitate mapping function and provide 232 connectivity to other ILA domains and the Internet. Each ILA 233 domain is assigned a /64 SIR prefix, which scopes all identifiers 234 in the domain. All ILA hosts and ILA routers within a domain are 235 aware of the SIR prefix of this domain. 237 o Enabling the ILA control plane. Configuring the BGP mesh for 238 mapping information dissemination within the ILA domain and 239 injecting the SIR prefix into routed network from the ILA routers 240 to facilitate communications among the ILA domain and from / to 241 the Internet. See [I-D.lapukhov-bgp-ila-afi] for definition of 242 the corresponding BGP extension. 244 o Deploying an address management solution to coordinate allocation 245 of ILA identifiers. In simplest cases, the addresses could be 246 generated on each host individually, without central coordination. 248 4. Preparing the network 250 This section provides overview of the network-related configuration 251 needed for ILA. 253 4.1. Data-center network topology 255 For ease of reference, this document adopts the Clos topology 256 described in [RFC7938] along with the terminology developed in that 257 document. 259 Tier-1 260 +-----+ 261 Cluster | | 262 +----------------------------+ +--| |--+ 263 | | | +-----+ | 264 | Tier-2 | | | Tier-2 265 | +-----+ | | +-----+ | +-----+ 266 | +-------------| DEV |------+--| |--+--| |-------------+ 267 | | +-----| C |------+ | | +--| |-----+ | 268 | | | +-----+ | +-----+ +-----+ | | 269 | | | | | | 270 | | | +-----+ | +-----+ +-----+ | | 271 | | +-----------| DEV |------+ | | +--| |-----------+ | 272 | | | | +---| D |------+--| |--+--| |---+ | | | 273 | | | | | +-----+ | | +-----+ | +-----+ | | | | 274 | | | | | | | | | | | | 275 | +-----+ +-----+ | | +-----+ | +-----+ +-----+ 276 | | DEV | | DEV | | +--| |--+ | | | | 277 | | A | | B | Tier-3 | | | Tier-3 | | | | 278 | +-----+ +-----+ | +-----+ +-----+ +-----+ 279 | | | | | | | | | | 280 | O O O O | O O O O 281 | Servers | Servers 282 +----------------------------+ 284 Figure 1: 5-Stage Clos topology 286 The network is partitioned hierarchically in three tiers, with tier 287 numbering starting at the "middle" stage of the Clos network. The 288 "middle" tier is often called as the "spine" of the network. 290 A set of directly connected Tier-2 and Tier-3 devices along with 291 their attached servers will be referred to as a "cluster". 293 Tier-3 switches that connect the servers, are often referred to as 294 "ToR" (Top of Rack) switches or simply "rack switches". 296 4.2. Configuring locator addressing 298 A mandatory prerequisite for ILA deployment is enabling IPv6 routing 299 in the network. This could be done using either dual-stack IPv4/IPv6 300 deployment or IPv6-only deployments. This document assumes the 301 network has been already configured to forward IPv6 traffic. See 302 [I-D.ietf-v6ops-dc-ipv6] for operational considerations on deploying 303 IPv6 in the datacenter. 305 ILA requires every ILA host to have at least one 64-bit locator 306 assigned. This means that every host (server) in the datacenter 307 network needs to have at least one /64 IPv6 prefix configured on one 308 of its interfaces. These /64 prefixes could be either globally 309 routable or unique-local. 311 The use of the globally routable addressing scheme allows for 312 deploying highly scalable hierarchical addressing scheme, and make 313 the locators accessible from the Internet. The figure below 314 illustrates the structure of a globally-routable locator: 316 |<------------------ Locator -------------------->| 317 |3 bits| N bits | M1 bits | M2 bits | M3 bits | 64 bits 318 +------+------------+---------+---------+---------+-------------------+ 319 | 001 | Global pfx | Cluster | Rack | Host | Identifier | 320 +------+------------+---------+---------+---------+-------------------+ 321 |<-------------------- 64-bits ------------------>| 323 For example, a global /32 prefix (N=29) allows for sub-allocation of 324 2^32 locators. This sub-allocation could be done hierarchically, 325 mapping to the tiers of network topology. Following the /32 example 326 prefix: 328 Allocate 256 /64 prefixes per Tier-3 switch (M3 = 8 bits), which 329 allows for up to 256 physical hosts in a rack, with /56 prefix 330 assigned per rack. 332 Assuming 256 Tier-3 switches per cluster, one would allocate /48 333 per cluster (M2 = 8 bits). 335 This leaves room for 16-bits (64K) cluster per datacenter (M1 = 16 336 bits). This space could be further sub-divided if multiple Clos 337 network fabrics have been deployed. 339 The use of unique-local addressing for locators is more limiting in 340 terms of available space, as it only offers 16-bits for sub- 341 allocation. It does, however, have the benefit of ad-hoc allocation. 342 This could work better for smaller deployment, e.g. allocating 343 10-bits to enumerate Tier-3 switches (physical racks of servers) and 344 6 bits to enumerate hosts within a rack. For instance, the address 345 structure may look as following, here M1 = 10 bits and M2 = 6 bits. 347 |<----------------- Locator --------------->| 348 | 7 bits |1| 40 bits | M1 bits | M2 bits | 64 bits | 349 +--------+-+------------+---------+---------+-------------------------+ 350 | FC00 |L| Global ID | Rack | Host | Identifier | 351 +--------+-+------------+---------+---------+-------------------------+ 352 | |<---- 16 bits ---->| 353 |<--------------- 64-bits ----------------->| 355 In either case, the addressing scheme is hierarchical, allowing for 356 simple route summarization logic and better routing system scaling 357 (see [RFC2791]). This is especially important in case of IPv6, since 358 contemporary datacenter network switches often have smaller IPv6 359 lookup tables as compared to IPv4. Route summarization also requires 360 certain network design changes to avoid packet black-holing under 361 link failures. This problem gets more complicated in Clos 362 topologies, and analyzed in more details in [RFC7938]. 364 In greenfield deployments, each ILA host could be assigned a /64 365 locator prefix prefix during provisioning phase. There are multiple 366 options to accomplish this: 368 o Assigning static link-local addresses to servers and statically 369 routing /64 prefixes from Tier-3 switches to the servers over 370 those link-local addresses. In this model, the operator would 371 plan and pre-allocate per ILA-host prefixes beforehand, and 372 configure the Tier-3 switches accordingly. From operational risks 373 perspective, if the server is not present while the static route 374 is configured on Tier-3 switch, packets destined to the 375 corresponding /64 prefix will cause the switch to continuously 376 generate IPv6 NDP packets ("gleaning"), which puts extra stress on 377 the device's CPU. 379 o The servers may request the /64 prefix using IPv6 Prefix 380 Delegation mechanism as defined in [RFC3633]. This allocation 381 could be made "permanent" by proper DHCPv6 server configuration 382 and ensuring the same prefix is always being delegated to the same 383 server. The Tier-3 switch would act as DHCPv6 relay and will 384 install the corresponding /64 IPv6 route dynamically. This 385 approach addresses both the allocation and the routing problem, 386 but makes the setup potentially more fragile operationally 387 (reliance on additional protocol) and harder to debug (additional 388 process involved). 390 o The server may run a routing daemon (e.g. BGP process) and inject 391 the pre-allocated /64 prefix into Tier-3 switch. The address 392 allocation in this case needs to happen by some other means. This 393 is more suitable for ad-hoc ILA testing and small, rapid 394 deployments. 396 The server itself may use one of the IPv6 addresses in /64 prefix for 397 its own addressing, e.g. for remote access or management purposes. 398 Alternatively, the server may obtain another IPv6 address from a 399 different (non-locator) IPv6 address range allocated for the 400 datacenter. This document recommends using ::1 as the 401 special identifier, naming it as "Common Locator Address" (CLA). 402 Such choice of identifier make it easy to differentiate from regular 403 identifiers. This identifier could be used for connectivity testing. 405 Route summarization for the locator prefixes is highly desirable to 406 reduce the stress on the network switches forwarding tables and 407 improve control-plane stability, and need to be implemented at least 408 on Tier-3 switches. In simplest case, the switches could be 409 statically preconfigured with the summary routes. These routes need 410 to agree with the prefixes that are assigned to the servers, 411 especially in the case when dynamic prefix injection is used. As a 412 possible alternative, simple virtual aggregation could be employed, 413 where hosts inject both the specific and the summary route, and 414 installation of corresponding FIB entries is suppressed as per the 415 rules defined in [RFC6769]. The latter approach does not improve the 416 control plane scalability, but solves the issues with packet black- 417 holing in presence of network summarization. It also requires the 418 network hardware support, which may not be present. 420 In retrofitting scenarios, the servers are likely to already have 421 128-bit IPv6 addresses assigned, allocated from the datacenter 422 address space, e.g. by using a single /64 prefix per Tier-3 switch. 423 In this case, the additional locator prefix needs to be assigned in 424 the same way as described above for greenfield deployments. The only 425 difference is that the new prefix and the old server address may be 426 allocated from different IPv6 address ranges. 428 5. Deploying ILA routers 430 ILA routers perform multiple functions within the ILA domain: 432 o Serve as the centralized store of the identifier-to-mapper 433 information in the domain. The mappings are delivered to the ILA 434 routers as described in Section 7. 436 o Act as the gateway between the ILA hosts and non-ILA capable 437 hosts, e.g. the Internet. 439 The ILA hosts will send the packets destined to identifiers they 440 don't have mappings for to the ILA routers initially to perform the 441 ILA translation, and the hosts outside of the ILA domain will use the 442 ILA routers for all communications with the domain. The ILA routers 443 may also act as ILA hosts and have one or more identifiers assigned. 445 5.1. ILA Redirect Message 447 ILA routers may originate and ILA hosts must receive and process ILA 448 redirect messages. The ILA redirect message is carried in UDP packet 449 and destined toward a well-known port. It carries the information 450 binding an identifier to its locator. For security purposes, this 451 message is expected to be authenticated by cryptographic means, such 452 as by using keyed HMAC (message authentication code) procedure. 453 Every host in the domain is then required to be configured with the 454 key information to be able to validate the redirecte messages. 456 The ILA redirect message might be signed with multiple HMAC keys to 457 facilitate key transition in the domain. The redirect message will 458 carry multiple signatures along with corresponding numeric key- 459 identigier, and the ILA hosts are expected to use the signature with 460 the highest locally known identifier. As the old key leaves 461 rotation, eventually every host will get updated and the signature 462 made using the old key could be removed. 464 5.2. Configuration parameters 466 The ILA routers need the following configured for their operation: 468 o Regular, non-anycast 128-bit IPv6 address to connect the ILA 469 router to the datacenter network. 471 o Cryptographic material to authenticate ILA redirect messages, for 472 example key to be used with HMAC scheme. 474 o The /64 SIR prefix for the ILA domain, shared by all ILA routers. 475 This prefix is advertised into the network in anycast fashion and 476 "intercepts" all traffic destined from hosts outside of ILA 477 domains to the SIR addresses in the domain. The prefix could be 478 injected in "always-on" fashion, e.g. by using BGP injectors on 479 ILA routers. This couples the ILA router's life-cycle with the 480 prefix injection cycle. 482 o Control-plane configuration, i.e. the IPv6 addresses of BGP route 483 reflectors, and possibly some configuration for the local BGP 484 process. This is discussed in more details in Section 7. 486 o Management settings, such as maximum rate of ILA redirect 487 messages, and associated security attributes (e.g. the key pair 488 used for message signing). 490 o A configuration flag that instructs the router whether the ILA 491 redirect messages needs to be sent out. The ILA router does not 492 receive ILA redirect messages, since by design it knows of all 493 active mappings in the domain. 495 5.3. ILA router operation 497 Upon booting, the ILA router is first required to join the control 498 plane mesh and learn of the mappings that exist in the ILA domain. 499 It is also aware of the SIR prefix that is used within its domain. 500 After the router has learned of the mappings, it may inject the 501 anycast SIR prefix in the datacenter network and join the operational 502 group of ILA routers. 504 Just like any ILA node, the ILA router is required to have a 64-bit 505 locator configured. Special identifier ::1 is used to build the 506 source and destination addresses of the ILA redirect messages. 508 When ILA router receives a packet with the upper 64-bits of the 509 destination IPv6 address matching its configured SIR prefix, it 510 performs the following: 512 o If the destination address does not match the SIR prefix, the ILA 513 router discards the packet, as it is not supposed to be received 514 by the ILA router. 516 o Attempts to resolve the source identifier (bottom 64-bits of the 517 source address), if applicable. If the source address matches SIR 518 prefix, it is coming from an ILA host. The route then needs to 519 translate the identifier found in the source address to its 520 locator. If the translation fails, send back the ILA "Mapping Not 521 Found" message. If the source address does not match the SIR 522 prefix, then no translation is needed, and no redirect messages 523 need to be sent back. 525 o Attempts to find the locator matching for the destination 526 identifier (the bottom 64-bits of the destination IPv6 address). 527 If the mapping for destination identifier is not found, the 528 original packet is dropped, and an ICMPv6 "Destination 529 Unreachable" message, type "3" is sent back to the message 530 originator. Otherwise, the router does the following: 532 * Rewrites the SIR prefix in the destination IPv6 address with 533 the new locator and forwards the packet back to the network. 535 * If sending of ILA redirect messages is permitted, the router 536 sends the ILA redirect message back to the originator of the 537 packet, by looking up the source identifier and finding the 538 corresponding locator. The redirect informs the source of the 539 actual destination locator. The redirect messages must be 540 rate-limited to avoid sending ILA redirect for every incoming 541 IPv6 packet. 543 * As mentioned previously, the source and the destination ILA 544 addresses of the redirect message IPv6 header use the 545 identifier value "::1", which designted them to be develired to 546 the ILA control process. 548 If the source IPv6 address check reveals that the packet is not 549 coming from the ILA domain the router belongs to (i.e. the SIR prefix 550 does not match), the ILA router does not need to send back the ILA 551 redirect message, but instead simply continue to forward the packet 552 as if the locator for the destination identifier could be found. The 553 ILA router will still send the ICMPv6 "Destinationa Unreachable" 554 message for unknown mappings. 556 5.4. Scaling considerations 558 Due to high load and reliability concerns, the ILA domain needs 559 multiple ILA routers. The simplest way to provide redundancy is by 560 letting the ILA routers inject the /64 SIR IPv6 prefix into the 561 datacenter network in anycast fashion ([RFC4786]). This will allow 562 to naturally use the datacenter network's Equal-Cost Multipath (ECMP) 563 capabilities to distribute traffic among the ILA routers. 565 For redundancy purposes, the ILA routers would need to be spread 566 across multiple physical racks in the datacenter. More ILA routers 567 could be added incrementally to reduce the load and scale capacity 568 horizontally, and join the operational ILA group in non-disruptive 569 fashion, after they have learned the full mapping table for the ILA 570 domain. 572 Use of anycast method does have some resulting routing implications. 573 For example, using the network described in Section 4.1 will result 574 in ILA hosts preferring to use the ILA routers in the same cluster, 575 since those are closer based on the routing metric. Thus, the 576 network may not evenly spread their packets across all ILA routers in 577 the datacenter. It is therefore possible that some ILA routers will 578 receive more traffic than the others. This issue is specific to 579 anycast routing in general, and not specifically to ILA. 581 6. Deploying ILA hosts 583 This section reviews the deployment considerations for the ILA hosts. 585 6.1. Configuration parameters 587 The ILA hosts need to be configured with the following: 589 o SIR prefix of the ILA domain. 591 o IPv6 addresses of the BGP route reflectors. 593 o The routable /64 locator assigned to the host. 595 o ILA mapping entries expiration time, to time out unused entries. 597 o Cryptographic material to allow validation of redirect messages. 599 o Boolean flag, defining whether ILA redirection messages sending / 600 receiving is enabled. 602 By disabling both the ILA mapping expiration time and the sending of 603 ILA redirect messages the host is effectively configured for the 604 "push" ILA mapping distribution distribution mode (see Section 8). 605 In this mode, the BGP (control plane) is assumed to update/ 606 synchronize all of the ILA mapping entries in response to the 607 identifier move events, and redirect messages are not used. 609 The host is expected to recevive ILA redirect messages destined to 610 its locator and identifier value of "::1". The source of such 611 message must also use the identifier value of "::1" to be considered 612 a redirect message. 614 6.2. Providing task isolation 616 In simplest case, the host only needs to implement the ILA address 617 rewrite function and inform the tasks starting on the host of the ILA 618 addresses they can use. However, it might be desirable to provide 619 the tasks with strong networking isolation guarantees, i.e. making 620 sure tasks are only allowed to use the IPv6 ILA address they have 621 been allocated. For instance, with Linux operating system, this is 622 possible by using the [LINUX-NAMESPACES] and [IPVLAN] techniques 623 together. 625 Each task running on the host will be contained to its own networking 626 namespace, and has the allocated ILA address bound to an interface 627 that belongs to this namespace. The task would then only be able to 628 bind to the single IPv6 ILA addresses delegated to the namespace. 630 With "ipvlan" technique, the packets arriving on physical host's NIC 631 need to have their locator field adjusted before delivering to the 632 task (the locator field is set to the /64 prefix assigned to the 633 host). No additional routing lookups need to be performed on the 634 physical host. On the egress path, all IPv6 lookups and rewrites 635 happen in the default namespace, in Linux terminology. The figure 636 below demonstrates a host with two tasks running, each in its own 637 networking namespace. The namespace names are "ns0" and "ns1", and 638 the corresponding task ILA identifiers are ID0 and ID1. 640 +=============================================================+ 641 | Host: host1 | 642 | | 643 | +----------------------+ +----------------------+ | 644 | | NS:ns0, ID0 | | NS:ns1, ID1 | | 645 | | | | | | 646 | | | | | | 647 | | ipvl0 | | ipvl1 | | 648 | +----------#-----------+ +-----------#----------+ | 649 | # # | 650 | ################################ | 651 | # eth0 | 652 +==============================#==============================+ 654 Tasks running in Linux namespaces with ipvlan 656 The use of "ipvlan"-like techniques is not strictly necessary. An 657 alternative would be use the ILA host as a proper IPv6 router and 658 treating the attached namespaces as hosts. This, however, has higher 659 performance overhead, due to multiple forwarding lookups that need to 660 be done in the kernel. 662 6.3. ILA host operation 664 When ILA host boots up, it joins the control-plane mesh by peering 665 with the BGP route-reflectors. It may learn the active ILA mappings 666 from the BGP route reflectors, or may initially keep the ILA mapping 667 table empty, depending whether "push" or "pull" distribution model 668 has been selected. 670 When a tasks starts it will have an ILA identifier allocated, and the 671 corresponding IPv6 address (built out of SIR prefix + the allocated 672 identifier) bound to an interface within the networking namespace 673 created for the task. The mapping is then propagated over BGP 674 peering sessions to all ILA routers. 676 For outgoing packets, the ILA host performs the following: 678 o Matches the destination IPv6 address against the SIR prefix. 680 o If prefix matches, attempts to look-up the identifier portion of 681 the address in the local ILA mapping table. 683 o If a match is found in ILA mapping table, rewrite the destination 684 address and replace the SIR prefix with the actual locator. 686 For packets with destination IPv6 addresses that do not match the SIR 687 prefix, usual forwarding rules apply. If no match is found for the 688 SIR address, the packet is sent as is, and is expected to be 689 delivered to the ILA routers, since those advertise the SIR prefix 690 into the routing domain (without getting the locator portion 691 rewritten - the packet has the SIR prefix in place of the locator). 693 For incoming packets, the ILA host should perform the following: 695 o Match their destination IPv6 addresses against the locator prefix 696 (64 bits) of the host. 698 o If the destination address matches, deliver the packet to the 699 corresponding namespace, based on the identifier portion. 701 o If the destination identifier in the incoming packet does not 702 match any of the ILA mappings, and sending of ILA redirect message 703 is enabled, the host sends an ILA redirect message back to the 704 originator of the packet. The message will have an empty locator 705 value, and informs the sender that the mapping it has for the 706 identifier is no longer valid, prompting to erase the 707 corresponding entry in the sender's ILA mapping table. 709 o If the source address is SIR address, the receiving host may 710 increase time-to-live for the corresponding mapping entry, if it 711 is present in the ILA mapping table. This acts as a signal 712 confirming liveness of the remote corresponding, and validity of 713 the existing mapping. Otherwise, the mapping would be expired 714 based on the time-to-live provided by the original ILA redirect 715 message, if ILA mapping expiration is enabled. 717 Sending an ILA redirect message by the ILA host requires the host to 718 translate the source identifier of the original message. Assuming 719 that flow was likely bi-directional, the entry should be readily 720 available in the local ILA mapping table. If not, the ILA redirect 721 message will be routed toward the originator via the ILA routers, 722 i.e. sent back with locator equal to the SIR prefix. It is possible 723 that both source and destination identifiers of the flow have moved, 724 resulting in mutual sending of ILA redirect messages, and temporarily 725 falling back to using the ILA routers. 727 If the ILA mapping entry expiration time is set to non-zero, the 728 unused ILA mapping entries will eventually be deleted. The entry 729 expiration needs to be disabled if the mappings are learned in event- 730 driven fashion via the BGP mesh ("push" distribution mode). 732 7. Using BGP as the ILA control plane 734 This section discusses the use of BGP for ILA mapping information 735 dissemination. The choice of BGP is made to allow for easier 736 integration of hardware appliance, e.g. network switches with 737 extended functionality, where BGP is commonly used as the control 738 plane. Furthermore, BGP itself offers a simple way of disseminating 739 data and converging on a key-value mapping across multiple nodes in 740 eventually consistent fashion, and has proven track record of use in 741 the industry. Furthermore, use of BGP allows for leveraging the 742 monitoring extensions developed for the protocol. For example, 743 [I-D.ietf-grow-bmp] could be used to observe ILA mapping changes in 744 the network using existing tooling. 746 7.1. BGP topology 748 Per the common practice, a group of BGP route-reflectors (see 749 [RFC4456]) should be deployed and peered over IBGP with all ILA hosts 750 and ILA routers in the ILA domain. The reflectors themselves would 751 also be peered in full-mesh fashion to provide backup paths for 752 mapping information distribution, e.g. in case if one of reflectors 753 loses a session to a host. Those reflectors do not need to be in the 754 data-path, but merely serve for the purpose of information 755 distribution. The number of route-reflectors should be at least two, 756 to allow for redundancy. See below sections for discussion of route- 757 reflection settings. 759 It is possible to co-locate the BGP route-reflectors with the ILA 760 routers. This saves on having additional nodes for the purpose of 761 just BGP route-reflection, but puts extra memory and CPU stress on 762 the ILA routers, and therefore is less desirable. Furthermore, it 763 makes capacity-planning more difficult, and therefore is not 764 recommended. 766 The route-reflectors are required to peer with potentially a very 767 large number of ILA hosts, which may put scaling limits on the size 768 of the ILA domain due to the overhead of maintaining large amount of 769 BGP peering sessions. To alleviate this problem, the pool of ILA 770 hosts may be split into "shards" and each shard would peer with a 771 different group of route-reflectors. For example, the ILA domain may 772 have four groups of route reflectors, each with four route- 773 reflectors. The sixteen route-reflectors may then peer in a full- 774 mesh fashion, to exchange the mappings they have received from the 775 corresponding "shard" of the ILA domain. This method avoid the 776 issues related to maintaining large amount of TCP sessions, but every 777 BGP route-reflector is still required to maintain the full ILA 778 mapping table. 780 In addition to ILA AFI/SAFI's, other AFI/SAFIs could be configured on 781 BGP speakers, e.g. using [I-D.lapukhov-bgp-opaque-signaling] for 782 opaque information dissemination in the ILA domain, e.g. to 783 facilitate in distributed address allocation. 785 7.2. Any-to-any mapping distribution 787 In this mode, the ILA routers could act as IBGP route-reflectors 788 [RFC4456] for all of the IBGP sessions they have, and relay the 789 mapping information among the ILA hosts. This would allow the hosts 790 to avoid initially sending packets to the ILA routers, at the expense 791 of maintaining the ILA mapping table. Additionally, this allows for 792 completely disabling the ILA redirect messages and using only the 793 mapping information propagated by BGP. 795 7.3. Hub-and-spoke mapping distribution 797 Alternatively, BGP could be used to deliver the mappings from ILA 798 hosts to ILA routers only. The hosts and the routers would establish 799 IBGP peering sessions with the route-reflectors in hub-and-spoke 800 fashion, with BGP reflectors being the hubs. The ILA router sessions 801 will be configured as the "route-reflector clients" on the route- 802 reflectors, while the ILA hosts sessions will be left as ordinary 803 IBGP sessions. This will propagate all needed mappings to the ILA 804 routers and allow them to properly redirect the hosts. The ILA hosts 805 are responsible for withdrawing and announcing the mappings as they 806 change. 808 8. Push vs pull mapping distribution modes 810 The default mode of operations in ILA is "pull" mode, where mappings 811 are learned by the ILA hosts via ILA redirect messages. Effectively, 812 the process of populating the ILA mapping table is reactive and 813 driven by data-plane events. In some case, e.g. upon identifier 814 move, this may result in short periods of packet loss, while the 815 sender receives the ILA redirect message and falls back to forwarding 816 via the ILA routers. Furthermore, the use of ILA redirect messages 817 requires security configuration to avoid message spoofing and cache 818 poisoning attacks. 820 An alternative to "pull" mapping distribution on the hosts, is "push" 821 mode, where all ILA hosts receive exactly the same mapping 822 information as the ILA routers. In fact, every ILA host may even 823 operate as an ILA router. In this case, the ILA message sending 824 could be disabled in the ILA domain altogether. The "push" mode 825 allows for proactive creation of the ILA mappings, and avoiding the 826 packet loss, provided that the new mapping reaches the sending host 827 before the destination identifier has moved. The trade-off here is 828 the overhead of maintaining full mapping set on all ILA hosts. 830 For simplicity, this document recommends that all ILA hosts in the 831 domain operate either in "push" or "pull" modes. In "push" mode the 832 ILA mapping entries expiration needs to be turned off, along with 833 sending of ILA messages. If an ILA host receives a packet for the 834 ILA address it cannot map to locally, it is expected to send an ILA 835 redirect message. If sending the ILA messages is disabled, the host 836 must at least send an ICMPv6 "Destination Unreachable" message with 837 code "3" - "Address Unreachable" to aid in debugging of missing 838 mapping message. Notice that the ILA routers always operate in 839 "push" mode, i.e. they only learn of mappings via the control plane 840 exchange. 842 9. ILA address management 844 The ILA control plane and redirect messages perform mapping 845 information dissemination, but the identifier allocation needs to be 846 done separately. The address management process also depends on 847 whether there is some hierarchy desired in the ILA namespace, e.g. if 848 allocating a prefix per-tenant is needed. 850 9.1. Decentralized address management 852 In simplest case, each ILA host may independently allocate unique 853 identifier per task when it first starts, and the task will retain it 854 for the duration of its lifetime (see Appendix A of 855 [I-D.herbert-nvo3-ila]). The chances of collision are very low given 856 the 60-bit value of the identifier. The scheduler is responsible for 857 starting and moving the task in the ILA domain. The tasks belonging 858 to the same tenant may discover each other's addresses by some out- 859 of-band signaling mechanism, e.g. a key-value store such as 860 ([MEMCACHED]) or [ETCD] or use BGP for the same purpose as described 861 in [I-D.lapukhov-bgp-opaque-signaling]. For instance, the task may 862 publish its own identifier, consisting of the tenant name and task 863 name, mapped to the SIR address of the task. 865 Decentralized allocation is still possible even if the unit of 866 address allocation is prefix, e.g. when multiple tenants are sharing 867 the infrastructure, and unique VNID (see [I-D.herbert-nvo3-ila] for 868 definition) is needed per tenant to build the 96-bit prefixes 869 allocated to tenants from the /64 SIR prefix. Since the size of VNID 870 space is rather small, generating random VNIDs becomes more prone to 871 collision. In this case, decentralized address allocation schemes, 872 such as one described in [RFC7695] could be used. These techniques 873 require the ILA nodes to have some shared communication medium for 874 nodes to "claim" the prefixes and avoid collisions. Once again, 875 various distributed key-value stores could be used to accomplish 876 this. 878 9.2. Centralized address management 880 In the case where high level of control is needed to allocate the 881 addresses, e.g. per-tenant prefixes, centralized address management 882 schemes could be used in the ILA domain. This could be either 883 proprietary address allocation system, or system built on top of 884 protocols such as DHCPv6. 886 9.3. Role of Task scheduler 888 The ILA domain needs a tasks scheduler responsible for resource 889 allocation and starting of tenant's tasks on the ILA nodes. Defining 890 functions of such scheduler is outside of scope of this document. At 891 the very minimum, the scheduler would need agents running on every 892 ILA host, participating in ILA address allocation, and communicating 893 with the ILA control plane to publish and remove the mappings. Since 894 it's the scheduler that is responsible for task movements, it makes 895 sense for the scheduler to update the mappings in the domain. 897 The scheduler needs some kind of API to interact with the BGP process 898 on the box. Defining the exact API is outside of scope of this 899 document, but as an option the scheduler may use a BGP session to 900 inject prefixes into the BGP process running on the box. 902 10. ILA domain federation 904 In default operation mode, the ILA domains act as if the other domain 905 is unaware of mappings that exist in another. It is possible to let 906 the two domains exchange the mapping information and honor the ILA 907 redirect messages from another domain by "merging" full or partial 908 mapping tables of the two domains. For example, one can envision 909 multiple compute clusters, each being its own ILA domain. In 910 standard ILA model, those clusters would need to communicate via the 911 ILA routers only, increasing stress on the data-plane. To allow 912 traffic flowing directly between the hosts in each cluster and 913 bypassing the ILA routers, the ILA domains may exchange the mapping 914 information, and program the ILA mappings in ILA hosts to facilitate 915 direct paths. 917 Since each domain may re-use the 64-bit identifier space on its own, 918 the use of SIR prefix is required to make the identifiers globally 919 unique. This requirement is easily fulfilled since the SIR prefix is 920 required to be globally routable in the Internet. 922 To enable ILA domain federation, the BGP route-reflectors in each 923 domain need need to be fully meshed and configured to use the "VPN- 924 ILA" SAFI with "ILA AFI" (see [I-D.lapukhov-bgp-ila-afi]). This will 925 propagate the mappings known to each route-reflector scoped with the 926 SIR prefix of the local domain. If multiple domains are federated in 927 this way, intermediate route-reflectors could be used, and filtering 928 techniques such as described in [RFC5291] and [RFC4684] could be 929 employed. The filtering may be further used to allow leaking of only 930 select mappings, e.g. for the identifiers or tenants that carry lots 931 of traffic. 933 If "push" distribution model is chosen with ILA domain federation, 934 the ILA hosts will need to be configured to use "VPN-ILA" SAFI on 935 their peering sessions with the BGP route reflectors. The ILA 936 mapping entries lookup then need to be keyed both on the SIR prefix 937 and the identifier to be resolved. Given the large volume of 938 mappings that may exist in federated model, the "pull" model might 939 become more preferable. 941 11. Operational Considerations 943 ILA introduces additional step in packet routing and thus adds more 944 complexity to network troubleshooting process. At the same time, 945 relative to the virtualization techniques that employ encapsulation 946 and tunneling, ILA makes the underlying physical network fully 947 visible to the tasks, and thus make tenant-driven troubleshooting 948 simpler. This section discusses some operational procedures specific 949 to ILA and the additional fault models that are possible in presence 950 of ILA. 952 11.1. Operational procedures for ILA routers 954 ILA routers may be added/removed from the network at any time. 955 Adding a router is commonly needed to scale the capacity of the ILA 956 router group when peak loads increases. Adding an ILA router is non- 957 disruptive procedure. It starts by configuring the ILA router to 958 peer with the BGP mesh to learn of all mappings in the domain. The 959 use of BGP graceful restart (see [RFC4724]) would allow the new 960 router to learn when all mappings have been advertised. At this 961 time, the router may inject the SIR prefix, joining the operational 962 group of ILA routers and start forwarding ILA traffic. 964 To gracefully take the ILA router out of service, it may be 965 instructed to stop announcing the SIR prefix, or, in case of BGP, 966 announce it with less preferable path attributes. This will allow 967 the router to still accept and forward all in-flight packets, but 968 will redirect the remaining packets toward the remaining ILA routers. 970 11.2. ICMPv6 Message generation by transit devices 972 Upong some conditions the transit, ILA-unware devices, may need to 973 generate ICMPv6 messages, e.g. when IPv6 hop limit exceedes. The 974 source of the packet sent by an ILA application would have SIR as the 975 prefix, and hence the ICMPv6 message will need to transit an ILA 976 router before getting back to the host that sent the original packet. 977 This has some operational downside, as it adds path stretch to the 978 control message flow, and needs to be accounted for operational 979 reasons. 981 When an ICMPv6 message generated by an intermediate device arrives 982 back to the sender of the original packet, the ILA may need to 983 translate the payload of the ICMPv6 message, as it often contain the 984 IPv6 header of the original packet. This is needed so that the 985 control message could be properly correlated to transport level 986 connection. Thus, it is expected that the ILA host stack will be 987 able to perform this translation, and replace the ILA locator with 988 SIR prefix in the destination address field of the encapsulated IPv6 989 header. 991 The last case is generating ICMPv6 message by transint device for 992 packet sourced by non-ILA host (or outside of local ILA domain) and 993 translated by an ILA router. In this case, the response will be 994 directed back to the non-ILA host, bypassing the ILA router, and 995 there will be no easy way to perform the translation of the location 996 portion in ILA destination address back to the SIR prefix. The non- 997 ILA sender would be able to process the ICMPv6 message. 999 11.3. Multicast routing 1001 Defining multicast routing and group membership dissemination is 1002 outside of scope of this document. 1004 11.4. Potential ILA mapping table complications 1006 Every packet egressing from an ILA host and matching the SIR prefix 1007 is subject to lookup and translation in the local ILA mapping table. 1008 If entry is not found, the packet is forwarded to the ILA router(s) 1009 by the virtue of SIR prefix injected in the datacenter network. If 1010 the ILA router does not have the mapping, either the ICMPv6 1011 "Destination Unreachable" or "ILA mapping not found" message will be 1012 sent back, depending on whether the original sender is ILA or non-ILA 1013 host. There are few observations to make here: 1015 o Packets egressing the ILA host and not matching the SIR prefix are 1016 routed as usual. 1018 o ILA destinations that are not yet present in the ILA mapping table 1019 will be initially routed toward the ILA routers (e.g. the ILA 1020 routers will show up in the initial "traceroute" command output). 1022 o In case of missing identifier mapping, it's the ILA router that 1023 informs the sender of this event via either an "ILA Mapping not 1024 Found" or ICMPv6 "Destination Unreachable" messages. 1026 Thus, the case of missing mapping is easily debuggable, though the 1027 "transition period" when the mapping is not yet in the ILA mapping 1028 table might confuse the operator using the "traceroute" command. 1030 The most difficult case of ILA mapping table malfunction would be 1031 presence of incorrect mapping, i.e mappings pointing to a non- 1032 existent or incorrect locator. 1034 o Non-existent locator. This will route the packet through the 1035 network, and eventually result either in packet getting discarded 1036 due to missing route or IPv6 NDP entry, or packet dropped due to 1037 routing loop and hop-limit expiration. In either case, the 1038 original sender may detect this condition either via reception of 1039 ICMPv6 "Destination Unreachable" messages, or by observing the 1040 output of the "traceroute" command. The ILA host may also be 1041 configured to make sure the identifiers fall within the known 1042 prefix range. 1044 o Incorrect locator. In this case, the packet will be delivered to 1045 the wrong ILA host, that does not have the mapping for the 1046 identifier. Depending on whether the sending of ILA redirect 1047 messages is enabled on the host, two scenarios are possible: 1049 * The destination ILA host sends back an ILA redirect message 1050 with empty locator, informing the sender that mapping is 1051 invalid. The sender will invalidate the ILA mapping entry and 1052 switch over to forwarding via the ILA routers. The latter will 1053 either inform if of the new mapping, or send an ICMPv6 1054 "Destination Unreachable" message back. 1056 * The destination ILA host is not configured to send the ILA 1057 redirect messages back. In this case, it simply responds with 1058 the ICMPv6 "Destination Unreachable" messages for the duration 1059 of time the sender keeps sending the packets using the 1060 incorrect mapping. The mapping needs to be flushed our updated 1061 by some external mean. 1063 Next possible failure is dropped ILA redirect messages. However, 1064 given that the ILA redirect message sending process is memoryless, 1065 the recipient will eventually receive one of them, or at least finish 1066 the communication via an ILA router. 1068 11.5. Potential ILA routers complications 1070 The ILA routers serve as proxies for traffic entering the ILA domain, 1071 as well as temporary transit hops for traffic between the ILA hosts 1072 when they don't have matching mappings, in case if "pull" 1073 distribution model is utilized. The following operational 1074 observations apply: 1076 o Traffic between the ILA domain and external world will necessarily 1077 flow asymmetrically. The packets toward the ILA hosts sent from 1078 the outside will always cross the ILA routers (see Section 10 for 1079 exceptions from this case) and traffic returning from the ILA 1080 hosts to the external world will flow directly, bypassing the ILA 1081 routers. This will show up in the outputs of the "traceroute" 1082 command running from sender and destination and showing asymmetric 1083 paths. This being said, asymmetric traffic flows are very common 1084 in modern networks, and thus it should be a problem on its own. 1086 o A failure of ILA router should be handled by re-balancing the load 1087 automatically by means of ECMP re-hashing in the network, and 1088 therefore should be mostly transparent to the ILA hosts, unless 1089 the load increases significantly after the failure. It is 1090 possible to have cascading failure and lose all ILA routers, or 1091 have them over-utilized. This event should be detected by 1092 external monitoring system, and be acted upon by adding more ILA 1093 routers to the domain - either automatically or manually. From 1094 troubleshooting perspective, the event will manifest itself via 1095 massive packet loss toward all hosts in the ILA domain. 1097 o A malfunction of single ILA router (e.g. network interface card 1098 issue) would manifest itself in somewhat increased packet drop 1099 ratios for flows crossing the ILA routers, mostly traffic from 1100 external nodes. The more ILA routers the domain has, the harder 1101 to notice this ratio would be, since ECMP mostly spreads traffic 1102 evenly over all the ILA routers. This problem is more specific to 1103 ECMP behavior, and tooling exists to deal with it in datacenter 1104 networks. 1106 o ILA routers are in path of the ICMPv6 messages generaed by non-ILA 1107 aware routers in the network. Thus, a loss of such packet in the 1108 network could not be differentiated from the loss due to the drop 1109 by an ILA router. This may potentially complicate network 1110 troubleshooting efforts. 1112 To sum the above up - the health of ILA router is critical to the ILA 1113 domain functions, even if "push" model is employed and the ILA 1114 routers are used mostly for external communications. The ILA routers 1115 should be monitored closely for vital parameters, such as CPU and 1116 memory utilization, traffic rates on their network interfaces, and 1117 packet loss toward the ILA routers themselves. 1119 12. Deployment Scenario Primer 1121 Building upon the concepts presented above, this section provides a 1122 simple ILA deployment scenario. 1124 o For locator addressing, unique-local addresses is used, with 1125 16-bit available for sub-allocation. This allows for 1024 (2^10) 1126 Tier-3 switches with 64 (2^4) servers under each Tier-3 switch. 1127 Using the Clos topology from section Section 4.1 one can build 32 1128 clusters with 32 Tier-3 switches each. 1130 o The hosts in the network would use BGP to peer with Tier-3 1131 switches and inject their locator prefixes. It's desirable, but 1132 not necessary to configure the route summarization on the network 1133 switches, depending on the size of the deployment. 1135 o Given the small to moderate scale of deployment, four IBGP route- 1136 reflectors would be deployed in the ILA domain, without the need 1137 for extra level of aggregation hierarchy. Each route-reflector 1138 will need to be configured to accept the BGP sessions from all of 1139 ILA hosts and be able to maintain thousands of peering sessions. 1141 o The ILA hosts and routers should be configured with a single SIR 1142 prefix, and set up for "push" mapping distribution model, by 1143 disabling sending the ILA redirect messages. All ILA mappings 1144 will be propagated to all hosts and ILA routers via BGP. Each ILA 1145 host and router will need to be running a BGP process and peer 1146 with all four route-reflectors. 1148 o The ILA routers will inject the SIR prefix using BGP into the 1149 network. 1151 o For tasks running on ILA hosts, the globally unique ILA 1152 identifiers should be allocated independently in pseudo-random 1153 fashion by the host that first starts the task. 1155 o As task is moved, the task scheduler will update the mapping and 1156 publish it via BGP, forcing the ILA routers and ILA hosts to 1157 update their ILA mapping tables. 1159 o ILA domain federation is not used, making every ILA domain 1160 communicate to each other via the ILA routers only. 1162 13. IANA Considerations 1164 None 1166 14. Manageability Considerations 1168 ILA requires both one-time deployment efforts, and recurring 1169 management work. The initial involvement is reasonably high, as it 1170 required extending the existing network and host configuration. It 1171 does not require any significant changes to the existing 1172 applications, though, aside from making the applications use newly 1173 allocated IPv6 addresses. Majority of the required changes could be 1174 done without any disruption to the existing infrastructure. 1176 ILA address management schemes could be arbitrarily complex, but in 1177 the most basic form do not require any centralized coordination. 1178 Thus, in many cases it could be a simple local subroutine that 1179 generates a pseudo-random identifier. 1181 Recurring management efforts are mostly concentrated on monitoring 1182 the component of ILA deployment, primarily the ILA routers and the 1183 BGP route reflectors. Troubleshooting these components follows the 1184 standard process and uses regular tooling, with the caveat of having 1185 more logical components to deal with, primarily the ILA routers and 1186 the ILA mapping tables on the ILA hosts. This increases the 1187 complexity of troubleshooting process, as more state needs to be 1188 inspected and validated. 1190 15. Security Considerations 1192 ILA introduces new security considerations described below. 1194 15.1. ILA host security 1196 If unsecured ILA redirect messages are used, the ILA hosts could be 1197 exposed to cache poisoning attacks. This calls for ILA redirect 1198 message authentication, e.g. by use of digital signatures, such as 1199 [ED25519]. This will also require to use some mechanism for 1200 propagation of public keys associated with the SIR prefix (the ILA 1201 routers) and every locator in the domain, since the ILA redirect 1202 message could be sent by either. 1204 To prevent tasks from every being able to sent packets directly 1205 bypassing the mapping layer, the ILA hosts should prohibit the task 1206 from sending packets toward the address space associated with the 1207 locators. Given that all locators will likely to belong to one large 1208 prefix, this could be accomplished by installing a single filtering 1209 rule on the ILA host. 1211 15.2. BGP Security 1213 Standard means of improving BGP security as described in [RFC7454] 1214 could be applied to harden the mapping dissemination system. Among 1215 them, the most important one is likely to be the "TCP Authentication 1216 Option" described in the referenced document. Notice that the BGP 1217 subsystem used to distribute the ILA mappings is not as vulnerable as 1218 the Internet BGP mesh, since it only work within the boundaries of a 1219 privately managed data-center. 1221 15.3. ILA router security 1223 ILA routers are primarily susceptible to various form of rate-based 1224 DDoS attacks. Primary concern would be overrruning the capabilities 1225 of ILA routers with too many packets sent from non-ILA hosts toward 1226 the SIR addresses, or "thundering herds" problem when ILA translation 1227 tables on the ILA hosts expire synchronously, or due to poisoning 1228 attack. Primary ways to address this concern would be closely 1229 monitoring server utilization and potentially rate-limiting packet 1230 flow to the ILA router on the upstream network device (ToR switch). 1232 15.4. Tenant security 1234 ILA does not natively isolate the tenant traffic from each other, nor 1235 from the underlying physical infrastructure. In fact, this is seen 1236 as one benefit that makes many troubleshooting processes easier. The 1237 access control then become responsibility of the tenant itself, by 1238 employing traffic filtering rules. To this point, implementing 1239 filtering rules gets simpler if the tenant is allocated single 1240 prefix, as opposed to each task getting an unique identifier. 1242 16. Acknowledgements 1244 TBD 1246 17. Informative References 1248 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 1249 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 1250 DOI 10.17487/RFC4271, January 2006, 1251 . 1253 [RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route 1254 Reflection: An Alternative to Full Mesh Internal BGP 1255 (IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006, 1256 . 1258 [RFC4684] Marques, P., Bonica, R., Fang, L., Martini, L., Raszuk, 1259 R., Patel, K., and J. Guichard, "Constrained Route 1260 Distribution for Border Gateway Protocol/MultiProtocol 1261 Label Switching (BGP/MPLS) Internet Protocol (IP) Virtual 1262 Private Networks (VPNs)", RFC 4684, DOI 10.17487/RFC4684, 1263 November 2006, . 1265 [RFC5291] Chen, E. and Y. Rekhter, "Outbound Route Filtering 1266 Capability for BGP-4", RFC 5291, DOI 10.17487/RFC5291, 1267 August 2008, . 1269 [RFC6740] Atkinson, RJ. and SN. Bhatti, "Identifier-Locator Network 1270 Protocol (ILNP) Architectural Description", RFC 6740, 1271 DOI 10.17487/RFC6740, November 2012, 1272 . 1274 [RFC2791] Yu, J., "Scalable Routing Design Principles", RFC 2791, 1275 DOI 10.17487/RFC2791, July 2000, 1276 . 1278 [RFC3633] Troan, O. and R. Droms, "IPv6 Prefix Options for Dynamic 1279 Host Configuration Protocol (DHCP) version 6", RFC 3633, 1280 DOI 10.17487/RFC3633, December 2003, 1281 . 1283 [RFC4724] Sangli, S., Chen, E., Fernando, R., Scudder, J., and Y. 1284 Rekhter, "Graceful Restart Mechanism for BGP", RFC 4724, 1285 DOI 10.17487/RFC4724, January 2007, 1286 . 1288 [RFC4760] Bates, T., Chandra, R., Katz, D., and Y. Rekhter, 1289 "Multiprotocol Extensions for BGP-4", RFC 4760, 1290 DOI 10.17487/RFC4760, January 2007, 1291 . 1293 [RFC4786] Abley, J. and K. Lindqvist, "Operation of Anycast 1294 Services", BCP 126, RFC 4786, DOI 10.17487/RFC4786, 1295 December 2006, . 1297 [RFC6769] Raszuk, R., Heitz, J., Lo, A., Zhang, L., and X. Xu, 1298 "Simple Virtual Aggregation (S-VA)", RFC 6769, 1299 DOI 10.17487/RFC6769, October 2012, 1300 . 1302 [RFC6830] Farinacci, D., Fuller, V., Meyer, D., and D. Lewis, "The 1303 Locator/ID Separation Protocol (LISP)", RFC 6830, 1304 DOI 10.17487/RFC6830, January 2013, 1305 . 1307 [RFC7454] Durand, J., Pepelnjak, I., and G. Doering, "BGP Operations 1308 and Security", BCP 194, RFC 7454, DOI 10.17487/RFC7454, 1309 February 2015, . 1311 [RFC7695] Pfister, P., Paterson, B., and J. Arkko, "Distributed 1312 Prefix Assignment Algorithm", RFC 7695, 1313 DOI 10.17487/RFC7695, November 2015, 1314 . 1316 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 1317 BGP for Routing in Large-Scale Data Centers", RFC 7938, 1318 DOI 10.17487/RFC7938, August 2016, 1319 . 1321 [I-D.herbert-nvo3-ila] 1322 Herbert, T., "Identifier-locator addressing for IPv6", 1323 draft-herbert-nvo3-ila-03 (work in progress), October 1324 2016. 1326 [I-D.lapukhov-bgp-opaque-signaling] 1327 Lapukhov, P., Aries, E., Marques, P., and E. Nkposong, 1328 "Use of BGP for Opaque Signaling", draft-lapukhov-bgp- 1329 opaque-signaling-02 (work in progress), April 2016. 1331 [I-D.ietf-v6ops-dc-ipv6] 1332 Lopez, D., Chen, Z., Tsou, T., Zhou, C., and A. Servin, 1333 "IPv6 Operational Guidelines for Datacenters", draft-ietf- 1334 v6ops-dc-ipv6-01 (work in progress), February 2014. 1336 [I-D.lapukhov-bgp-ila-afi] 1337 Lapukhov, P., "Use of BGP for dissemination of ILA mapping 1338 information", draft-lapukhov-bgp-ila-afi-01 (work in 1339 progress), March 2016. 1341 [I-D.ietf-grow-bmp] 1342 Scudder, J., Fernando, R., and S. Stuart, "BGP Monitoring 1343 Protocol", draft-ietf-grow-bmp-17 (work in progress), 1344 January 2016. 1346 [I-D.ietf-nvo3-arch] 1347 Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. 1348 Narten, "An Architecture for Data Center Network 1349 Virtualization Overlays (NVO3)", draft-ietf-nvo3-arch-08 1350 (work in progress), September 2016. 1352 [ED25519] "Ed25519: high-speed high-security signatures", 1353 . 1355 [ETCD] "coreos/etcd", . 1357 [MEMCACHED] 1358 "Memcached", . 1360 [ROUTED-DESIGN] 1361 "High Availability Campus Network Design", 2008, . 1365 [LINUX-NAMESPACES] 1366 "Namespaces in operation, part 1: namespaces overview", 1367 2013, . 1369 [IPVLAN] "IPVLAN Driver HOWTO", 2013, 1370 . 1373 Author's Address 1374 Petr Lapukhov 1375 Facebook 1376 1 Hacker Way 1377 Menlo Park, CA 94025 1378 US 1380 Email: petr@fb.com