idnits 2.17.1 

draft-lapukhov-ila-deployment-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([I-D.herbert-nvo3-ila]), which
     it shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (October 31, 2016) is 2734 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Unused Reference: 'RFC4760' is defined on line 1288, but no explicit
     reference was found in the text

  -- Obsolete informational reference (is this intentional?): RFC 3633
     (Obsoleted by RFC 8415)

  -- Obsolete informational reference (is this intentional?): RFC 6830
     (Obsoleted by RFC 9300, RFC 9301)

  == Outdated reference: A later version (-04) exists of
     draft-herbert-nvo3-ila-03

  == Outdated reference: A later version (-04) exists of
     draft-lapukhov-bgp-opaque-signaling-02

  == Outdated reference: A later version (-02) exists of
     draft-lapukhov-bgp-ila-afi-01


     Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                        P. Lapukhov
3	Internet-Draft                                                  Facebook
4	Intended status: Informational                          October 31, 2016
5	Expires: May 4, 2017

7	  Deploying Identifier-Locator Addressing (ILA) in datacenter networks
8	                    draft-lapukhov-ila-deployment-01

10	Abstract

12	   Identifier-Locator Addressing architecture defined in
13	   [I-D.herbert-nvo3-ila] proposes the use of locator-identifier split
14	   in IPv6 address to realize workload mobility and more efficient use
15	   of network resources.  This document describes how ILA can be
16	   implemented in datacenter using BGP as the control-plane protocol.
17	   Generally speaking, ILA could be built using different control
18	   planes, and BGP is one particular instantiation.  The motivation is
19	   BGP being a well-known protocol, sufficient for small to medium size
20	   deployments, on scale of few millions of identifier to locator
21	   mappings.  Defining more generic and scalable control plane variants
22	   is outside of scope of this document.

24	Status of This Memo

26	   This Internet-Draft is submitted in full conformance with the
27	   provisions of BCP 78 and BCP 79.

29	   Internet-Drafts are working documents of the Internet Engineering
30	   Task Force (IETF).  Note that other groups may also distribute
31	   working documents as Internet-Drafts.  The list of current Internet-
32	   Drafts is at http://datatracker.ietf.org/drafts/current/.

34	   Internet-Drafts are draft documents valid for a maximum of six months
35	   and may be updated, replaced, or obsoleted by other documents at any
36	   time.  It is inappropriate to use Internet-Drafts as reference
37	   material or to cite them other than as "work in progress."

39	   This Internet-Draft will expire on May 4, 2017.

41	Copyright Notice

43	   Copyright (c) 2016 IETF Trust and the persons identified as the
44	   document authors.  All rights reserved.

46	   This document is subject to BCP 78 and the IETF Trust's Legal
47	   Provisions Relating to IETF Documents
48	   (http://trustee.ietf.org/license-info) in effect on the date of
49	   publication of this document.  Please review these documents
50	   carefully, as they describe your rights and restrictions with respect
51	   to this document.  Code Components extracted from this document must
52	   include Simplified BSD License text as described in Section 4.e of
53	   the Trust Legal Provisions and are provided without warranty as
54	   described in the Simplified BSD License.

56	Table of Contents

58	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
59	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
60	   3.  ILA deployment process  . . . . . . . . . . . . . . . . . . .   5
61	   4.  Preparing the network . . . . . . . . . . . . . . . . . . . .   6
62	     4.1.  Data-center network topology  . . . . . . . . . . . . . .   6
63	     4.2.  Configuring locator addressing  . . . . . . . . . . . . .   7
64	   5.  Deploying ILA routers . . . . . . . . . . . . . . . . . . . .  10
65	     5.1.  ILA Redirect Message  . . . . . . . . . . . . . . . . . .  10
66	     5.2.  Configuration parameters  . . . . . . . . . . . . . . . .  10
67	     5.3.  ILA router operation  . . . . . . . . . . . . . . . . . .  11
68	     5.4.  Scaling considerations  . . . . . . . . . . . . . . . . .  12
69	   6.  Deploying ILA hosts . . . . . . . . . . . . . . . . . . . . .  13
70	     6.1.  Configuration parameters  . . . . . . . . . . . . . . . .  13
71	     6.2.  Providing task isolation  . . . . . . . . . . . . . . . .  13
72	     6.3.  ILA host operation  . . . . . . . . . . . . . . . . . . .  14
73	   7.  Using BGP as the ILA control plane  . . . . . . . . . . . . .  16
74	     7.1.  BGP topology  . . . . . . . . . . . . . . . . . . . . . .  16
75	     7.2.  Any-to-any mapping distribution . . . . . . . . . . . . .  17
76	     7.3.  Hub-and-spoke mapping distribution  . . . . . . . . . . .  17
77	   8.  Push vs pull mapping distribution modes . . . . . . . . . . .  18
78	   9.  ILA address management  . . . . . . . . . . . . . . . . . . .  18
79	     9.1.  Decentralized address management  . . . . . . . . . . . .  18
80	     9.2.  Centralized address management  . . . . . . . . . . . . .  19
81	     9.3.  Role of Task scheduler  . . . . . . . . . . . . . . . . .  19
82	   10. ILA domain federation . . . . . . . . . . . . . . . . . . . .  20
83	   11. Operational Considerations  . . . . . . . . . . . . . . . . .  20
84	     11.1.  Operational procedures for ILA routers . . . . . . . . .  21
85	     11.2.  ICMPv6 Message generation by transit devices . . . . . .  21
86	     11.3.  Multicast routing  . . . . . . . . . . . . . . . . . . .  22
87	     11.4.  Potential ILA mapping table complications  . . . . . . .  22
88	     11.5.  Potential ILA routers complications  . . . . . . . . . .  23
89	   12. Deployment Scenario Primer  . . . . . . . . . . . . . . . . .  24
90	   13. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  25
91	   14. Manageability Considerations  . . . . . . . . . . . . . . . .  25
92	   15. Security Considerations . . . . . . . . . . . . . . . . . . .  26
93	     15.1.  ILA host security  . . . . . . . . . . . . . . . . . . .  26
94	     15.2.  BGP Security . . . . . . . . . . . . . . . . . . . . . .  26
95	     15.3.  ILA router security  . . . . . . . . . . . . . . . . . .  26
96	     15.4.  Tenant security  . . . . . . . . . . . . . . . . . . . .  26

98	   16. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  27
99	   17. Informative References  . . . . . . . . . . . . . . . . . . .  27
100	   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  29

102	1.  Introduction

104	   This document provides high-level guidelines for building an ILA-
105	   enabled datacenter using BGP [RFC4271] as the protocol for ILA
106	   mapping information dissemination.  The reader is expected to be
107	   familiar with the principles presented in [I-D.herbert-nvo3-ila].
108	   Reading on ILNP architecture defined in [RFC6740] is also
109	   recommended, but not needed for understanding of this document.
110	   While ILA does not implement the original ILNP proposal, it's based
111	   on the same idea of maintaining the Identifier vs Locator split in
112	   the IPv6 address.

114	   ILA benefits from routed datacenter networks, i.e. networks that do
115	   not rely on spanning Layer-2 domains across multiple network devices.
116	   Endpoint mobility made possible by ILA is one of the key benefits ILA
117	   brings to the datacenter networks.  Combining ILA with fully routed
118	   network design allows for achieving the robustness of routed network
119	   with the flexibility of endpoint mobility.  Some practical
120	   recommendations for building a fully-routed datacenter network could
121	   be found in [RFC7938] or [ROUTED-DESIGN].

123	   Though workload mobility could also be achieved in L3 switched
124	   networks by using "host-route injection" technique, such approach has
125	   limited applicability, due to high stress put on the underlying
126	   control and data planes.  The mobile prefix needs to be removed, re-
127	   injected and propagated to all network devices every time an address
128	   moves.

130	   ILA is an alternative to "encapsulation" approaches, such as LISP
131	   ([RFC6830]), for realizing the endpoint mobility and network
132	   virtualization.  Using simple address rewrites significantly reduces
133	   the processing overhead on the hosts, and makes various hardware and
134	   software network acceleration functions easier to implement (e.g.
135	   checksum computation offload).  Furthermore, ILA keeps the underlying
136	   network fully visible to the applications that use ILA addresses,
137	   which makes network troubleshooting easier, as compared to the
138	   "encapsulation" approaches.

140	2.  Terminology

142	   This section defines ILA-specific terminology that will be used
143	   through the document.

145	      ILA domain: a collection of ILA hosts and ILA routers that
146	      collectively support ILA identifier mobility and network
147	      virtualization model.  The ILA domain is assigned a single 64-bit
148	      IPv6 prefix known as SIR (Standard Identifier Representation, see
149	      [I-D.herbert-nvo3-ila]) prefix, which is made known to all hosts
150	      and routers in the domain.  This prefix is used to construct the
151	      complete 128-bit IPv6 addresses for ILA identifies found in the
152	      domain.

154	      SIR Address: IPv6 address constructed from SIR prefix concatenated
155	      with the 64-bit identifier.  This is the address visible to the
156	      applications and transport layer on ILA hosts.

158	      ILA Address: IPv6 address constructed from actual valid 64-bit
159	      locator and 64-bit identifier.  This address is what being seen by
160	      transit network devices - it is expected to be routable in the
161	      underlying network.

163	      ILA mapping table: The table for mapping identifiers to locators
164	      present in ILA host or ILA router.  This table is updated either
165	      via BGP, or ILA redirect messages.  ILA routers maintain full
166	      authoritative copy of the table, while ILA hosts may have their
167	      own smaller view of the global mapping state.

169	      ILA host: network endpoint that is capable of accepting and
170	      originating packets with ILA addresses, by performing stateless
171	      rewrite between SIR addresses and ILA addresses.  The host
172	      maintains its own local version of the ILA mapping table and has
173	      at least one ILA locator (64-bit prefix) assigned.

175	      Non-ILA host: network endpoint that is not aware of ILA addressing
176	      structure and does not participate in ILA address translations.
177	      To this host, the SIR and ILA addresses look like regular IPv6
178	      addresses.

180	      ILA router: network endpoint that is responsible for two main
181	      functions:

183	      *  Storing and disseminating the authoritative ILA mapping
184	         information within the ILA domain (NVA role per
185	         [I-D.ietf-nvo3-arch]).

187	      *  Serving as the gateway between the ILA-hosts and non-ILA hosts,
188	         as well as the gateway for communicating with other ILA domains
189	         (NVE role per [I-D.ietf-nvo3-arch]).

191	      Task: the unit of mobility in ILA domain.  Each task is assigned
192	      an identifier unique within the ILA domain, which follows the task
193	      as it changes the hosts and, consequently, the locators.
194	      Implementation wise, the task can run within a container or a
195	      virtual machine, for example.

197	      Tenant: owner of the tasks executed in the shared environment.

199	      Common Locator Address (CLA): Special ILA address constructed as
200	      <locator>::1 and identifying the physical host itself.  This
201	      address is used to send and receive of the ILA redirect messages.

203	3.  ILA deployment process

205	   The ILA domain consists of the following conceptual elements:

207	   o  Routed network that provides reachability among physical hosts,
208	      i.e. provides routing within the locator address space.

210	   o  ILA hosts, each assigned a unique /64 prefix reachable within the
211	      network.  ILA hosts maintain their own local version of ILA
212	      mapping table.

214	   o  ILA routers, each injecting the domain's SIR prefix into the
215	      routed network and maintaining the full mapping table for the ILA
216	      domain.  The routers could be implemented in software, or using
217	      specialized hardware appliances.

219	   o  Centralized BGP router-reflector nodes that peer with all of the
220	      ILA hosts and all of the ILA routers within the domain for the
221	      purpose of mapping information dissemination.  ILA hosts and
222	      routers run the BGP processes to communicate with the reflectors.

224	   Deploying ILA in datacenter requires the following logical steps:

226	   o  Preparing the network.  Assigning locator addressing to the hosts
227	      (servers) in the network and providing routed interconnection
228	      among the locator prefixes.

230	   o  Configuring ILA hosts and ILA routers.  Each ILA domain requires a
231	      set of ILA routers to facilitate mapping function and provide
232	      connectivity to other ILA domains and the Internet.  Each ILA
233	      domain is assigned a /64 SIR prefix, which scopes all identifiers
234	      in the domain.  All ILA hosts and ILA routers within a domain are
235	      aware of the SIR prefix of this domain.

237	   o  Enabling the ILA control plane.  Configuring the BGP mesh for
238	      mapping information dissemination within the ILA domain and
239	      injecting the SIR prefix into routed network from the ILA routers
240	      to facilitate communications among the ILA domain and from / to
241	      the Internet.  See [I-D.lapukhov-bgp-ila-afi] for definition of
242	      the corresponding BGP extension.

244	   o  Deploying an address management solution to coordinate allocation
245	      of ILA identifiers.  In simplest cases, the addresses could be
246	      generated on each host individually, without central coordination.

248	4.  Preparing the network

250	   This section provides overview of the network-related configuration
251	   needed for ILA.

253	4.1.  Data-center network topology

255	   For ease of reference, this document adopts the Clos topology
256	   described in [RFC7938] along with the terminology developed in that
257	   document.

259	                                      Tier-1
260	                                     +-----+
261	          Cluster                    |     |
262	 +----------------------------+   +--|     |--+
263	 |                            |   |  +-----+  |
264	 |                    Tier-2  |   |           |   Tier-2
265	 |                   +-----+  |   |  +-----+  |  +-----+
266	 |     +-------------| DEV |------+--|     |--+--|     |-------------+
267	 |     |       +-----|  C  |------+  |     |  +--|     |-----+       |
268	 |     |       |     +-----+  |      +-----+     +-----+     |       |
269	 |     |       |              |                              |       |
270	 |     |       |     +-----+  |      +-----+     +-----+     |       |
271	 |     | +-----------| DEV |------+  |     |  +--|     |-----------+ |
272	 |     | |     | +---|  D  |------+--|     |--+--|     |---+ |     | |
273	 |     | |     | |   +-----+  |   |  +-----+  |  +-----+   | |     | |
274	 |     | |     | |            |   |           |            | |     | |
275	 |   +-----+ +-----+          |   |  +-----+  |          +-----+ +-----+
276	 |   | DEV | | DEV |          |   +--|     |--+          |     | |     |
277	 |   |  A  | |  B  | Tier-3   |      |     |      Tier-3 |     | |     |
278	 |   +-----+ +-----+          |      +-----+             +-----+ +-----+
279	 |     | |     | |            |                            | |     | |
280	 |     O O     O O            |                            O O     O O
281	 |       Servers              |                              Servers
282	 +----------------------------+

284	                      Figure 1: 5-Stage Clos topology

286	   The network is partitioned hierarchically in three tiers, with tier
287	   numbering starting at the "middle" stage of the Clos network.  The
288	   "middle" tier is often called as the "spine" of the network.

290	   A set of directly connected Tier-2 and Tier-3 devices along with
291	   their attached servers will be referred to as a "cluster".

293	   Tier-3 switches that connect the servers, are often referred to as
294	   "ToR" (Top of Rack) switches or simply "rack switches".

296	4.2.  Configuring locator addressing

298	   A mandatory prerequisite for ILA deployment is enabling IPv6 routing
299	   in the network.  This could be done using either dual-stack IPv4/IPv6
300	   deployment or IPv6-only deployments.  This document assumes the
301	   network has been already configured to forward IPv6 traffic.  See
302	   [I-D.ietf-v6ops-dc-ipv6] for operational considerations on deploying
303	   IPv6 in the datacenter.

305	   ILA requires every ILA host to have at least one 64-bit locator
306	   assigned.  This means that every host (server) in the datacenter
307	   network needs to have at least one /64 IPv6 prefix configured on one
308	   of its interfaces.  These /64 prefixes could be either globally
309	   routable or unique-local.

311	   The use of the globally routable addressing scheme allows for
312	   deploying highly scalable hierarchical addressing scheme, and make
313	   the locators accessible from the Internet.  The figure below
314	   illustrates the structure of a globally-routable locator:

316	 |<------------------ Locator -------------------->|
317	 |3 bits| N bits     | M1 bits | M2 bits | M3 bits |       64 bits
318	 +------+------------+---------+---------+---------+-------------------+
319	 | 001  | Global pfx | Cluster |   Rack  |   Host  |    Identifier     |
320	 +------+------------+---------+---------+---------+-------------------+
321	 |<-------------------- 64-bits ------------------>|

323	   For example, a global /32 prefix (N=29) allows for sub-allocation of
324	   2^32 locators.  This sub-allocation could be done hierarchically,
325	   mapping to the tiers of network topology.  Following the /32 example
326	   prefix:

328	      Allocate 256 /64 prefixes per Tier-3 switch (M3 = 8 bits), which
329	      allows for up to 256 physical hosts in a rack, with /56 prefix
330	      assigned per rack.

332	      Assuming 256 Tier-3 switches per cluster, one would allocate /48
333	      per cluster (M2 = 8 bits).

335	      This leaves room for 16-bits (64K) cluster per datacenter (M1 = 16
336	      bits).  This space could be further sub-divided if multiple Clos
337	      network fabrics have been deployed.

339	   The use of unique-local addressing for locators is more limiting in
340	   terms of available space, as it only offers 16-bits for sub-
341	   allocation.  It does, however, have the benefit of ad-hoc allocation.
342	   This could work better for smaller deployment, e.g. allocating
343	   10-bits to enumerate Tier-3 switches (physical racks of servers) and
344	   6 bits to enumerate hosts within a rack.  For instance, the address
345	   structure may look as following, here M1 = 10 bits and M2 = 6 bits.

347	 |<----------------- Locator --------------->|
348	 | 7 bits |1|  40 bits   | M1 bits | M2 bits |          64 bits        |
349	 +--------+-+------------+---------+---------+-------------------------+
350	 | FC00   |L| Global ID  |  Rack   |   Host  |        Identifier       |
351	 +--------+-+------------+---------+---------+-------------------------+
352	 |                       |<---- 16 bits ---->|
353	 |<--------------- 64-bits ----------------->|

355	   In either case, the addressing scheme is hierarchical, allowing for
356	   simple route summarization logic and better routing system scaling
357	   (see [RFC2791]).  This is especially important in case of IPv6, since
358	   contemporary datacenter network switches often have smaller IPv6
359	   lookup tables as compared to IPv4.  Route summarization also requires
360	   certain network design changes to avoid packet black-holing under
361	   link failures.  This problem gets more complicated in Clos
362	   topologies, and analyzed in more details in [RFC7938].

364	   In greenfield deployments, each ILA host could be assigned a /64
365	   locator prefix prefix during provisioning phase.  There are multiple
366	   options to accomplish this:

368	   o  Assigning static link-local addresses to servers and statically
369	      routing /64 prefixes from Tier-3 switches to the servers over
370	      those link-local addresses.  In this model, the operator would
371	      plan and pre-allocate per ILA-host prefixes beforehand, and
372	      configure the Tier-3 switches accordingly.  From operational risks
373	      perspective, if the server is not present while the static route
374	      is configured on Tier-3 switch, packets destined to the
375	      corresponding /64 prefix will cause the switch to continuously
376	      generate IPv6 NDP packets ("gleaning"), which puts extra stress on
377	      the device's CPU.

379	   o  The servers may request the /64 prefix using IPv6 Prefix
380	      Delegation mechanism as defined in [RFC3633].  This allocation
381	      could be made "permanent" by proper DHCPv6 server configuration
382	      and ensuring the same prefix is always being delegated to the same
383	      server.  The Tier-3 switch would act as DHCPv6 relay and will
384	      install the corresponding /64 IPv6 route dynamically.  This
385	      approach addresses both the allocation and the routing problem,
386	      but makes the setup potentially more fragile operationally
387	      (reliance on additional protocol) and harder to debug (additional
388	      process involved).

390	   o  The server may run a routing daemon (e.g.  BGP process) and inject
391	      the pre-allocated /64 prefix into Tier-3 switch.  The address
392	      allocation in this case needs to happen by some other means.  This
393	      is more suitable for ad-hoc ILA testing and small, rapid
394	      deployments.

396	   The server itself may use one of the IPv6 addresses in /64 prefix for
397	   its own addressing, e.g. for remote access or management purposes.
398	   Alternatively, the server may obtain another IPv6 address from a
399	   different (non-locator) IPv6 address range allocated for the
400	   datacenter.  This document recommends using <locator>::1 as the
401	   special identifier, naming it as "Common Locator Address" (CLA).
402	   Such choice of identifier make it easy to differentiate from regular
403	   identifiers.  This identifier could be used for connectivity testing.

405	   Route summarization for the locator prefixes is highly desirable to
406	   reduce the stress on the network switches forwarding tables and
407	   improve control-plane stability, and need to be implemented at least
408	   on Tier-3 switches.  In simplest case, the switches could be
409	   statically preconfigured with the summary routes.  These routes need
410	   to agree with the prefixes that are assigned to the servers,
411	   especially in the case when dynamic prefix injection is used.  As a
412	   possible alternative, simple virtual aggregation could be employed,
413	   where hosts inject both the specific and the summary route, and
414	   installation of corresponding FIB entries is suppressed as per the
415	   rules defined in [RFC6769].  The latter approach does not improve the
416	   control plane scalability, but solves the issues with packet black-
417	   holing in presence of network summarization.  It also requires the
418	   network hardware support, which may not be present.

420	   In retrofitting scenarios, the servers are likely to already have
421	   128-bit IPv6 addresses assigned, allocated from the datacenter
422	   address space, e.g. by using a single /64 prefix per Tier-3 switch.
423	   In this case, the additional locator prefix needs to be assigned in
424	   the same way as described above for greenfield deployments.  The only
425	   difference is that the new prefix and the old server address may be
426	   allocated from different IPv6 address ranges.

428	5.  Deploying ILA routers

430	   ILA routers perform multiple functions within the ILA domain:

432	   o  Serve as the centralized store of the identifier-to-mapper
433	      information in the domain.  The mappings are delivered to the ILA
434	      routers as described in Section 7.

436	   o  Act as the gateway between the ILA hosts and non-ILA capable
437	      hosts, e.g. the Internet.

439	   The ILA hosts will send the packets destined to identifiers they
440	   don't have mappings for to the ILA routers initially to perform the
441	   ILA translation, and the hosts outside of the ILA domain will use the
442	   ILA routers for all communications with the domain.  The ILA routers
443	   may also act as ILA hosts and have one or more identifiers assigned.

445	5.1.  ILA Redirect Message

447	   ILA routers may originate and ILA hosts must receive and process ILA
448	   redirect messages.  The ILA redirect message is carried in UDP packet
449	   and destined toward a well-known port.  It carries the information
450	   binding an identifier to its locator.  For security purposes, this
451	   message is expected to be authenticated by cryptographic means, such
452	   as by using keyed HMAC (message authentication code) procedure.
453	   Every host in the domain is then required to be configured with the
454	   key information to be able to validate the redirecte messages.

456	   The ILA redirect message might be signed with multiple HMAC keys to
457	   facilitate key transition in the domain.  The redirect message will
458	   carry multiple signatures along with corresponding numeric key-
459	   identigier, and the ILA hosts are expected to use the signature with
460	   the highest locally known identifier.  As the old key leaves
461	   rotation, eventually every host will get updated and the signature
462	   made using the old key could be removed.

464	5.2.  Configuration parameters

466	   The ILA routers need the following configured for their operation:

468	   o  Regular, non-anycast 128-bit IPv6 address to connect the ILA
469	      router to the datacenter network.

471	   o  Cryptographic material to authenticate ILA redirect messages, for
472	      example key to be used with HMAC scheme.

474	   o  The /64 SIR prefix for the ILA domain, shared by all ILA routers.
475	      This prefix is advertised into the network in anycast fashion and
476	      "intercepts" all traffic destined from hosts outside of ILA
477	      domains to the SIR addresses in the domain.  The prefix could be
478	      injected in "always-on" fashion, e.g. by using BGP injectors on
479	      ILA routers.  This couples the ILA router's life-cycle with the
480	      prefix injection cycle.

482	   o  Control-plane configuration, i.e. the IPv6 addresses of BGP route
483	      reflectors, and possibly some configuration for the local BGP
484	      process.  This is discussed in more details in Section 7.

486	   o  Management settings, such as maximum rate of ILA redirect
487	      messages, and associated security attributes (e.g. the key pair
488	      used for message signing).

490	   o  A configuration flag that instructs the router whether the ILA
491	      redirect messages needs to be sent out.  The ILA router does not
492	      receive ILA redirect messages, since by design it knows of all
493	      active mappings in the domain.

495	5.3.  ILA router operation

497	   Upon booting, the ILA router is first required to join the control
498	   plane mesh and learn of the mappings that exist in the ILA domain.
499	   It is also aware of the SIR prefix that is used within its domain.
500	   After the router has learned of the mappings, it may inject the
501	   anycast SIR prefix in the datacenter network and join the operational
502	   group of ILA routers.

504	   Just like any ILA node, the ILA router is required to have a 64-bit
505	   locator configured.  Special identifier ::1 is used to build the
506	   source and destination addresses of the ILA redirect messages.

508	   When ILA router receives a packet with the upper 64-bits of the
509	   destination IPv6 address matching its configured SIR prefix, it
510	   performs the following:

512	   o  If the destination address does not match the SIR prefix, the ILA
513	      router discards the packet, as it is not supposed to be received
514	      by the ILA router.

516	   o  Attempts to resolve the source identifier (bottom 64-bits of the
517	      source address), if applicable.  If the source address matches SIR
518	      prefix, it is coming from an ILA host.  The route then needs to
519	      translate the identifier found in the source address to its
520	      locator.  If the translation fails, send back the ILA "Mapping Not
521	      Found" message.  If the source address does not match the SIR
522	      prefix, then no translation is needed, and no redirect messages
523	      need to be sent back.

525	   o  Attempts to find the locator matching for the destination
526	      identifier (the bottom 64-bits of the destination IPv6 address).
527	      If the mapping for destination identifier is not found, the
528	      original packet is dropped, and an ICMPv6 "Destination
529	      Unreachable" message, type "3" is sent back to the message
530	      originator.  Otherwise, the router does the following:

532	      *  Rewrites the SIR prefix in the destination IPv6 address with
533	         the new locator and forwards the packet back to the network.

535	      *  If sending of ILA redirect messages is permitted, the router
536	         sends the ILA redirect message back to the originator of the
537	         packet, by looking up the source identifier and finding the
538	         corresponding locator.  The redirect informs the source of the
539	         actual destination locator.  The redirect messages must be
540	         rate-limited to avoid sending ILA redirect for every incoming
541	         IPv6 packet.

543	      *  As mentioned previously, the source and the destination ILA
544	         addresses of the redirect message IPv6 header use the
545	         identifier value "::1", which designted them to be develired to
546	         the ILA control process.

548	   If the source IPv6 address check reveals that the packet is not
549	   coming from the ILA domain the router belongs to (i.e. the SIR prefix
550	   does not match), the ILA router does not need to send back the ILA
551	   redirect message, but instead simply continue to forward the packet
552	   as if the locator for the destination identifier could be found.  The
553	   ILA router will still send the ICMPv6 "Destinationa Unreachable"
554	   message for unknown mappings.

556	5.4.  Scaling considerations

558	   Due to high load and reliability concerns, the ILA domain needs
559	   multiple ILA routers.  The simplest way to provide redundancy is by
560	   letting the ILA routers inject the /64 SIR IPv6 prefix into the
561	   datacenter network in anycast fashion ([RFC4786]).  This will allow
562	   to naturally use the datacenter network's Equal-Cost Multipath (ECMP)
563	   capabilities to distribute traffic among the ILA routers.

565	   For redundancy purposes, the ILA routers would need to be spread
566	   across multiple physical racks in the datacenter.  More ILA routers
567	   could be added incrementally to reduce the load and scale capacity
568	   horizontally, and join the operational ILA group in non-disruptive
569	   fashion, after they have learned the full mapping table for the ILA
570	   domain.

572	   Use of anycast method does have some resulting routing implications.
573	   For example, using the network described in Section 4.1 will result
574	   in ILA hosts preferring to use the ILA routers in the same cluster,
575	   since those are closer based on the routing metric.  Thus, the
576	   network may not evenly spread their packets across all ILA routers in
577	   the datacenter.  It is therefore possible that some ILA routers will
578	   receive more traffic than the others.  This issue is specific to
579	   anycast routing in general, and not specifically to ILA.

581	6.  Deploying ILA hosts

583	   This section reviews the deployment considerations for the ILA hosts.

585	6.1.  Configuration parameters

587	   The ILA hosts need to be configured with the following:

589	   o  SIR prefix of the ILA domain.

591	   o  IPv6 addresses of the BGP route reflectors.

593	   o  The routable /64 locator assigned to the host.

595	   o  ILA mapping entries expiration time, to time out unused entries.

597	   o  Cryptographic material to allow validation of redirect messages.

599	   o  Boolean flag, defining whether ILA redirection messages sending /
600	      receiving is enabled.

602	   By disabling both the ILA mapping expiration time and the sending of
603	   ILA redirect messages the host is effectively configured for the
604	   "push" ILA mapping distribution distribution mode (see Section 8).
605	   In this mode, the BGP (control plane) is assumed to update/
606	   synchronize all of the ILA mapping entries in response to the
607	   identifier move events, and redirect messages are not used.

609	   The host is expected to recevive ILA redirect messages destined to
610	   its locator and identifier value of "::1".  The source of such
611	   message must also use the identifier value of "::1" to be considered
612	   a redirect message.

614	6.2.  Providing task isolation

616	   In simplest case, the host only needs to implement the ILA address
617	   rewrite function and inform the tasks starting on the host of the ILA
618	   addresses they can use.  However, it might be desirable to provide
619	   the tasks with strong networking isolation guarantees, i.e. making
620	   sure tasks are only allowed to use the IPv6 ILA address they have
621	   been allocated.  For instance, with Linux operating system, this is
622	   possible by using the [LINUX-NAMESPACES] and [IPVLAN] techniques
623	   together.

625	   Each task running on the host will be contained to its own networking
626	   namespace, and has the allocated ILA address bound to an interface
627	   that belongs to this namespace.  The task would then only be able to
628	   bind to the single IPv6 ILA addresses delegated to the namespace.

630	   With "ipvlan" technique, the packets arriving on physical host's NIC
631	   need to have their locator field adjusted before delivering to the
632	   task (the locator field is set to the /64 prefix assigned to the
633	   host).  No additional routing lookups need to be performed on the
634	   physical host.  On the egress path, all IPv6 lookups and rewrites
635	   happen in the default namespace, in Linux terminology.  The figure
636	   below demonstrates a host with two tasks running, each in its own
637	   networking namespace.  The namespace names are "ns0" and "ns1", and
638	   the corresponding task ILA identifiers are ID0 and ID1.

640	   +=============================================================+
641	   |  Host: host1                                                |
642	   |                                                             |
643	   |   +----------------------+      +----------------------+    |
644	   |   |   NS:ns0, ID0        |      |  NS:ns1, ID1         |    |
645	   |   |                      |      |                      |    |
646	   |   |                      |      |                      |    |
647	   |   |        ipvl0         |      |         ipvl1        |    |
648	   |   +----------#-----------+      +-----------#----------+    |
649	   |              #                              #               |
650	   |              ################################               |
651	   |                              # eth0                         |
652	   +==============================#==============================+

654	               Tasks running in Linux namespaces with ipvlan

656	   The use of "ipvlan"-like techniques is not strictly necessary.  An
657	   alternative would be use the ILA host as a proper IPv6 router and
658	   treating the attached namespaces as hosts.  This, however, has higher
659	   performance overhead, due to multiple forwarding lookups that need to
660	   be done in the kernel.

662	6.3.  ILA host operation

664	   When ILA host boots up, it joins the control-plane mesh by peering
665	   with the BGP route-reflectors.  It may learn the active ILA mappings
666	   from the BGP route reflectors, or may initially keep the ILA mapping
667	   table empty, depending whether "push" or "pull" distribution model
668	   has been selected.

670	   When a tasks starts it will have an ILA identifier allocated, and the
671	   corresponding IPv6 address (built out of SIR prefix + the allocated
672	   identifier) bound to an interface within the networking namespace
673	   created for the task.  The mapping is then propagated over BGP
674	   peering sessions to all ILA routers.

676	   For outgoing packets, the ILA host performs the following:

678	   o  Matches the destination IPv6 address against the SIR prefix.

680	   o  If prefix matches, attempts to look-up the identifier portion of
681	      the address in the local ILA mapping table.

683	   o  If a match is found in ILA mapping table, rewrite the destination
684	      address and replace the SIR prefix with the actual locator.

686	   For packets with destination IPv6 addresses that do not match the SIR
687	   prefix, usual forwarding rules apply.  If no match is found for the
688	   SIR address, the packet is sent as is, and is expected to be
689	   delivered to the ILA routers, since those advertise the SIR prefix
690	   into the routing domain (without getting the locator portion
691	   rewritten - the packet has the SIR prefix in place of the locator).

693	   For incoming packets, the ILA host should perform the following:

695	   o  Match their destination IPv6 addresses against the locator prefix
696	      (64 bits) of the host.

698	   o  If the destination address matches, deliver the packet to the
699	      corresponding namespace, based on the identifier portion.

701	   o  If the destination identifier in the incoming packet does not
702	      match any of the ILA mappings, and sending of ILA redirect message
703	      is enabled, the host sends an ILA redirect message back to the
704	      originator of the packet.  The message will have an empty locator
705	      value, and informs the sender that the mapping it has for the
706	      identifier is no longer valid, prompting to erase the
707	      corresponding entry in the sender's ILA mapping table.

709	   o  If the source address is SIR address, the receiving host may
710	      increase time-to-live for the corresponding mapping entry, if it
711	      is present in the ILA mapping table.  This acts as a signal
712	      confirming liveness of the remote corresponding, and validity of
713	      the existing mapping.  Otherwise, the mapping would be expired
714	      based on the time-to-live provided by the original ILA redirect
715	      message, if ILA mapping expiration is enabled.

717	   Sending an ILA redirect message by the ILA host requires the host to
718	   translate the source identifier of the original message.  Assuming
719	   that flow was likely bi-directional, the entry should be readily
720	   available in the local ILA mapping table.  If not, the ILA redirect
721	   message will be routed toward the originator via the ILA routers,
722	   i.e. sent back with locator equal to the SIR prefix.  It is possible
723	   that both source and destination identifiers of the flow have moved,
724	   resulting in mutual sending of ILA redirect messages, and temporarily
725	   falling back to using the ILA routers.

727	   If the ILA mapping entry expiration time is set to non-zero, the
728	   unused ILA mapping entries will eventually be deleted.  The entry
729	   expiration needs to be disabled if the mappings are learned in event-
730	   driven fashion via the BGP mesh ("push" distribution mode).

732	7.  Using BGP as the ILA control plane

734	   This section discusses the use of BGP for ILA mapping information
735	   dissemination.  The choice of BGP is made to allow for easier
736	   integration of hardware appliance, e.g. network switches with
737	   extended functionality, where BGP is commonly used as the control
738	   plane.  Furthermore, BGP itself offers a simple way of disseminating
739	   data and converging on a key-value mapping across multiple nodes in
740	   eventually consistent fashion, and has proven track record of use in
741	   the industry.  Furthermore, use of BGP allows for leveraging the
742	   monitoring extensions developed for the protocol.  For example,
743	   [I-D.ietf-grow-bmp] could be used to observe ILA mapping changes in
744	   the network using existing tooling.

746	7.1.  BGP topology

748	   Per the common practice, a group of BGP route-reflectors (see
749	   [RFC4456]) should be deployed and peered over IBGP with all ILA hosts
750	   and ILA routers in the ILA domain.  The reflectors themselves would
751	   also be peered in full-mesh fashion to provide backup paths for
752	   mapping information distribution, e.g. in case if one of reflectors
753	   loses a session to a host.  Those reflectors do not need to be in the
754	   data-path, but merely serve for the purpose of information
755	   distribution.  The number of route-reflectors should be at least two,
756	   to allow for redundancy.  See below sections for discussion of route-
757	   reflection settings.

759	   It is possible to co-locate the BGP route-reflectors with the ILA
760	   routers.  This saves on having additional nodes for the purpose of
761	   just BGP route-reflection, but puts extra memory and CPU stress on
762	   the ILA routers, and therefore is less desirable.  Furthermore, it
763	   makes capacity-planning more difficult, and therefore is not
764	   recommended.

766	   The route-reflectors are required to peer with potentially a very
767	   large number of ILA hosts, which may put scaling limits on the size
768	   of the ILA domain due to the overhead of maintaining large amount of
769	   BGP peering sessions.  To alleviate this problem, the pool of ILA
770	   hosts may be split into "shards" and each shard would peer with a
771	   different group of route-reflectors.  For example, the ILA domain may
772	   have four groups of route reflectors, each with four route-
773	   reflectors.  The sixteen route-reflectors may then peer in a full-
774	   mesh fashion, to exchange the mappings they have received from the
775	   corresponding "shard" of the ILA domain.  This method avoid the
776	   issues related to maintaining large amount of TCP sessions, but every
777	   BGP route-reflector is still required to maintain the full ILA
778	   mapping table.

780	   In addition to ILA AFI/SAFI's, other AFI/SAFIs could be configured on
781	   BGP speakers, e.g. using [I-D.lapukhov-bgp-opaque-signaling] for
782	   opaque information dissemination in the ILA domain, e.g. to
783	   facilitate in distributed address allocation.

785	7.2.  Any-to-any mapping distribution

787	   In this mode, the ILA routers could act as IBGP route-reflectors
788	   [RFC4456] for all of the IBGP sessions they have, and relay the
789	   mapping information among the ILA hosts.  This would allow the hosts
790	   to avoid initially sending packets to the ILA routers, at the expense
791	   of maintaining the ILA mapping table.  Additionally, this allows for
792	   completely disabling the ILA redirect messages and using only the
793	   mapping information propagated by BGP.

795	7.3.  Hub-and-spoke mapping distribution

797	   Alternatively, BGP could be used to deliver the mappings from ILA
798	   hosts to ILA routers only.  The hosts and the routers would establish
799	   IBGP peering sessions with the route-reflectors in hub-and-spoke
800	   fashion, with BGP reflectors being the hubs.  The ILA router sessions
801	   will be configured as the "route-reflector clients" on the route-
802	   reflectors, while the ILA hosts sessions will be left as ordinary
803	   IBGP sessions.  This will propagate all needed mappings to the ILA
804	   routers and allow them to properly redirect the hosts.  The ILA hosts
805	   are responsible for withdrawing and announcing the mappings as they
806	   change.

808	8.  Push vs pull mapping distribution modes

810	   The default mode of operations in ILA is "pull" mode, where mappings
811	   are learned by the ILA hosts via ILA redirect messages.  Effectively,
812	   the process of populating the ILA mapping table is reactive and
813	   driven by data-plane events.  In some case, e.g. upon identifier
814	   move, this may result in short periods of packet loss, while the
815	   sender receives the ILA redirect message and falls back to forwarding
816	   via the ILA routers.  Furthermore, the use of ILA redirect messages
817	   requires security configuration to avoid message spoofing and cache
818	   poisoning attacks.

820	   An alternative to "pull" mapping distribution on the hosts, is "push"
821	   mode, where all ILA hosts receive exactly the same mapping
822	   information as the ILA routers.  In fact, every ILA host may even
823	   operate as an ILA router.  In this case, the ILA message sending
824	   could be disabled in the ILA domain altogether.  The "push" mode
825	   allows for proactive creation of the ILA mappings, and avoiding the
826	   packet loss, provided that the new mapping reaches the sending host
827	   before the destination identifier has moved.  The trade-off here is
828	   the overhead of maintaining full mapping set on all ILA hosts.

830	   For simplicity, this document recommends that all ILA hosts in the
831	   domain operate either in "push" or "pull" modes.  In "push" mode the
832	   ILA mapping entries expiration needs to be turned off, along with
833	   sending of ILA messages.  If an ILA host receives a packet for the
834	   ILA address it cannot map to locally, it is expected to send an ILA
835	   redirect message.  If sending the ILA messages is disabled, the host
836	   must at least send an ICMPv6 "Destination Unreachable" message with
837	   code "3" - "Address Unreachable" to aid in debugging of missing
838	   mapping message.  Notice that the ILA routers always operate in
839	   "push" mode, i.e. they only learn of mappings via the control plane
840	   exchange.

842	9.  ILA address management

844	   The ILA control plane and redirect messages perform mapping
845	   information dissemination, but the identifier allocation needs to be
846	   done separately.  The address management process also depends on
847	   whether there is some hierarchy desired in the ILA namespace, e.g. if
848	   allocating a prefix per-tenant is needed.

850	9.1.  Decentralized address management

852	   In simplest case, each ILA host may independently allocate unique
853	   identifier per task when it first starts, and the task will retain it
854	   for the duration of its lifetime (see Appendix A of
855	   [I-D.herbert-nvo3-ila]).  The chances of collision are very low given
856	   the 60-bit value of the identifier.  The scheduler is responsible for
857	   starting and moving the task in the ILA domain.  The tasks belonging
858	   to the same tenant may discover each other's addresses by some out-
859	   of-band signaling mechanism, e.g. a key-value store such as
860	   ([MEMCACHED]) or [ETCD] or use BGP for the same purpose as described
861	   in [I-D.lapukhov-bgp-opaque-signaling].  For instance, the task may
862	   publish its own identifier, consisting of the tenant name and task
863	   name, mapped to the SIR address of the task.

865	   Decentralized allocation is still possible even if the unit of
866	   address allocation is prefix, e.g. when multiple tenants are sharing
867	   the infrastructure, and unique VNID (see [I-D.herbert-nvo3-ila] for
868	   definition) is needed per tenant to build the 96-bit prefixes
869	   allocated to tenants from the /64 SIR prefix.  Since the size of VNID
870	   space is rather small, generating random VNIDs becomes more prone to
871	   collision.  In this case, decentralized address allocation schemes,
872	   such as one described in [RFC7695] could be used.  These techniques
873	   require the ILA nodes to have some shared communication medium for
874	   nodes to "claim" the prefixes and avoid collisions.  Once again,
875	   various distributed key-value stores could be used to accomplish
876	   this.

878	9.2.  Centralized address management

880	   In the case where high level of control is needed to allocate the
881	   addresses, e.g. per-tenant prefixes, centralized address management
882	   schemes could be used in the ILA domain.  This could be either
883	   proprietary address allocation system, or system built on top of
884	   protocols such as DHCPv6.

886	9.3.  Role of Task scheduler

888	   The ILA domain needs a tasks scheduler responsible for resource
889	   allocation and starting of tenant's tasks on the ILA nodes.  Defining
890	   functions of such scheduler is outside of scope of this document.  At
891	   the very minimum, the scheduler would need agents running on every
892	   ILA host, participating in ILA address allocation, and communicating
893	   with the ILA control plane to publish and remove the mappings.  Since
894	   it's the scheduler that is responsible for task movements, it makes
895	   sense for the scheduler to update the mappings in the domain.

897	   The scheduler needs some kind of API to interact with the BGP process
898	   on the box.  Defining the exact API is outside of scope of this
899	   document, but as an option the scheduler may use a BGP session to
900	   inject prefixes into the BGP process running on the box.

902	10.  ILA domain federation

904	   In default operation mode, the ILA domains act as if the other domain
905	   is unaware of mappings that exist in another.  It is possible to let
906	   the two domains exchange the mapping information and honor the ILA
907	   redirect messages from another domain by "merging" full or partial
908	   mapping tables of the two domains.  For example, one can envision
909	   multiple compute clusters, each being its own ILA domain.  In
910	   standard ILA model, those clusters would need to communicate via the
911	   ILA routers only, increasing stress on the data-plane.  To allow
912	   traffic flowing directly between the hosts in each cluster and
913	   bypassing the ILA routers, the ILA domains may exchange the mapping
914	   information, and program the ILA mappings in ILA hosts to facilitate
915	   direct paths.

917	   Since each domain may re-use the 64-bit identifier space on its own,
918	   the use of SIR prefix is required to make the identifiers globally
919	   unique.  This requirement is easily fulfilled since the SIR prefix is
920	   required to be globally routable in the Internet.

922	   To enable ILA domain federation, the BGP route-reflectors in each
923	   domain need need to be fully meshed and configured to use the "VPN-
924	   ILA" SAFI with "ILA AFI" (see [I-D.lapukhov-bgp-ila-afi]).  This will
925	   propagate the mappings known to each route-reflector scoped with the
926	   SIR prefix of the local domain.  If multiple domains are federated in
927	   this way, intermediate route-reflectors could be used, and filtering
928	   techniques such as described in [RFC5291] and [RFC4684] could be
929	   employed.  The filtering may be further used to allow leaking of only
930	   select mappings, e.g. for the identifiers or tenants that carry lots
931	   of traffic.

933	   If "push" distribution model is chosen with ILA domain federation,
934	   the ILA hosts will need to be configured to use "VPN-ILA" SAFI on
935	   their peering sessions with the BGP route reflectors.  The ILA
936	   mapping entries lookup then need to be keyed both on the SIR prefix
937	   and the identifier to be resolved.  Given the large volume of
938	   mappings that may exist in federated model, the "pull" model might
939	   become more preferable.

941	11.  Operational Considerations

943	   ILA introduces additional step in packet routing and thus adds more
944	   complexity to network troubleshooting process.  At the same time,
945	   relative to the virtualization techniques that employ encapsulation
946	   and tunneling, ILA makes the underlying physical network fully
947	   visible to the tasks, and thus make tenant-driven troubleshooting
948	   simpler.  This section discusses some operational procedures specific
949	   to ILA and the additional fault models that are possible in presence
950	   of ILA.

952	11.1.  Operational procedures for ILA routers

954	   ILA routers may be added/removed from the network at any time.
955	   Adding a router is commonly needed to scale the capacity of the ILA
956	   router group when peak loads increases.  Adding an ILA router is non-
957	   disruptive procedure.  It starts by configuring the ILA router to
958	   peer with the BGP mesh to learn of all mappings in the domain.  The
959	   use of BGP graceful restart (see [RFC4724]) would allow the new
960	   router to learn when all mappings have been advertised.  At this
961	   time, the router may inject the SIR prefix, joining the operational
962	   group of ILA routers and start forwarding ILA traffic.

964	   To gracefully take the ILA router out of service, it may be
965	   instructed to stop announcing the SIR prefix, or, in case of BGP,
966	   announce it with less preferable path attributes.  This will allow
967	   the router to still accept and forward all in-flight packets, but
968	   will redirect the remaining packets toward the remaining ILA routers.

970	11.2.  ICMPv6 Message generation by transit devices

972	   Upong some conditions the transit, ILA-unware devices, may need to
973	   generate ICMPv6 messages, e.g. when IPv6 hop limit exceedes.  The
974	   source of the packet sent by an ILA application would have SIR as the
975	   prefix, and hence the ICMPv6 message will need to transit an ILA
976	   router before getting back to the host that sent the original packet.
977	   This has some operational downside, as it adds path stretch to the
978	   control message flow, and needs to be accounted for operational
979	   reasons.

981	   When an ICMPv6 message generated by an intermediate device arrives
982	   back to the sender of the original packet, the ILA may need to
983	   translate the payload of the ICMPv6 message, as it often contain the
984	   IPv6 header of the original packet.  This is needed so that the
985	   control message could be properly correlated to transport level
986	   connection.  Thus, it is expected that the ILA host stack will be
987	   able to perform this translation, and replace the ILA locator with
988	   SIR prefix in the destination address field of the encapsulated IPv6
989	   header.

991	   The last case is generating ICMPv6 message by transint device for
992	   packet sourced by non-ILA host (or outside of local ILA domain) and
993	   translated by an ILA router.  In this case, the response will be
994	   directed back to the non-ILA host, bypassing the ILA router, and
995	   there will be no easy way to perform the translation of the location
996	   portion in ILA destination address back to the SIR prefix.  The non-
997	   ILA sender would be able to process the ICMPv6 message.

999	11.3.  Multicast routing

1001	   Defining multicast routing and group membership dissemination is
1002	   outside of scope of this document.

1004	11.4.  Potential ILA mapping table complications

1006	   Every packet egressing from an ILA host and matching the SIR prefix
1007	   is subject to lookup and translation in the local ILA mapping table.
1008	   If entry is not found, the packet is forwarded to the ILA router(s)
1009	   by the virtue of SIR prefix injected in the datacenter network.  If
1010	   the ILA router does not have the mapping, either the ICMPv6
1011	   "Destination Unreachable" or "ILA mapping not found" message will be
1012	   sent back, depending on whether the original sender is ILA or non-ILA
1013	   host.  There are few observations to make here:

1015	   o  Packets egressing the ILA host and not matching the SIR prefix are
1016	      routed as usual.

1018	   o  ILA destinations that are not yet present in the ILA mapping table
1019	      will be initially routed toward the ILA routers (e.g. the ILA
1020	      routers will show up in the initial "traceroute" command output).

1022	   o  In case of missing identifier mapping, it's the ILA router that
1023	      informs the sender of this event via either an "ILA Mapping not
1024	      Found" or ICMPv6 "Destination Unreachable" messages.

1026	   Thus, the case of missing mapping is easily debuggable, though the
1027	   "transition period" when the mapping is not yet in the ILA mapping
1028	   table might confuse the operator using the "traceroute" command.

1030	   The most difficult case of ILA mapping table malfunction would be
1031	   presence of incorrect mapping, i.e mappings pointing to a non-
1032	   existent or incorrect locator.

1034	   o  Non-existent locator.  This will route the packet through the
1035	      network, and eventually result either in packet getting discarded
1036	      due to missing route or IPv6 NDP entry, or packet dropped due to
1037	      routing loop and hop-limit expiration.  In either case, the
1038	      original sender may detect this condition either via reception of
1039	      ICMPv6 "Destination Unreachable" messages, or by observing the
1040	      output of the "traceroute" command.  The ILA host may also be
1041	      configured to make sure the identifiers fall within the known
1042	      prefix range.

1044	   o  Incorrect locator.  In this case, the packet will be delivered to
1045	      the wrong ILA host, that does not have the mapping for the
1046	      identifier.  Depending on whether the sending of ILA redirect
1047	      messages is enabled on the host, two scenarios are possible:

1049	      *  The destination ILA host sends back an ILA redirect message
1050	         with empty locator, informing the sender that mapping is
1051	         invalid.  The sender will invalidate the ILA mapping entry and
1052	         switch over to forwarding via the ILA routers.  The latter will
1053	         either inform if of the new mapping, or send an ICMPv6
1054	         "Destination Unreachable" message back.

1056	      *  The destination ILA host is not configured to send the ILA
1057	         redirect messages back.  In this case, it simply responds with
1058	         the ICMPv6 "Destination Unreachable" messages for the duration
1059	         of time the sender keeps sending the packets using the
1060	         incorrect mapping.  The mapping needs to be flushed our updated
1061	         by some external mean.

1063	   Next possible failure is dropped ILA redirect messages.  However,
1064	   given that the ILA redirect message sending process is memoryless,
1065	   the recipient will eventually receive one of them, or at least finish
1066	   the communication via an ILA router.

1068	11.5.  Potential ILA routers complications

1070	   The ILA routers serve as proxies for traffic entering the ILA domain,
1071	   as well as temporary transit hops for traffic between the ILA hosts
1072	   when they don't have matching mappings, in case if "pull"
1073	   distribution model is utilized.  The following operational
1074	   observations apply:

1076	   o  Traffic between the ILA domain and external world will necessarily
1077	      flow asymmetrically.  The packets toward the ILA hosts sent from
1078	      the outside will always cross the ILA routers (see Section 10 for
1079	      exceptions from this case) and traffic returning from the ILA
1080	      hosts to the external world will flow directly, bypassing the ILA
1081	      routers.  This will show up in the outputs of the "traceroute"
1082	      command running from sender and destination and showing asymmetric
1083	      paths.  This being said, asymmetric traffic flows are very common
1084	      in modern networks, and thus it should be a problem on its own.

1086	   o  A failure of ILA router should be handled by re-balancing the load
1087	      automatically by means of ECMP re-hashing in the network, and
1088	      therefore should be mostly transparent to the ILA hosts, unless
1089	      the load increases significantly after the failure.  It is
1090	      possible to have cascading failure and lose all ILA routers, or
1091	      have them over-utilized.  This event should be detected by
1092	      external monitoring system, and be acted upon by adding more ILA
1093	      routers to the domain - either automatically or manually.  From
1094	      troubleshooting perspective, the event will manifest itself via
1095	      massive packet loss toward all hosts in the ILA domain.

1097	   o  A malfunction of single ILA router (e.g. network interface card
1098	      issue) would manifest itself in somewhat increased packet drop
1099	      ratios for flows crossing the ILA routers, mostly traffic from
1100	      external nodes.  The more ILA routers the domain has, the harder
1101	      to notice this ratio would be, since ECMP mostly spreads traffic
1102	      evenly over all the ILA routers.  This problem is more specific to
1103	      ECMP behavior, and tooling exists to deal with it in datacenter
1104	      networks.

1106	   o  ILA routers are in path of the ICMPv6 messages generaed by non-ILA
1107	      aware routers in the network.  Thus, a loss of such packet in the
1108	      network could not be differentiated from the loss due to the drop
1109	      by an ILA router.  This may potentially complicate network
1110	      troubleshooting efforts.

1112	   To sum the above up - the health of ILA router is critical to the ILA
1113	   domain functions, even if "push" model is employed and the ILA
1114	   routers are used mostly for external communications.  The ILA routers
1115	   should be monitored closely for vital parameters, such as CPU and
1116	   memory utilization, traffic rates on their network interfaces, and
1117	   packet loss toward the ILA routers themselves.

1119	12.  Deployment Scenario Primer

1121	   Building upon the concepts presented above, this section provides a
1122	   simple ILA deployment scenario.

1124	   o  For locator addressing, unique-local addresses is used, with
1125	      16-bit available for sub-allocation.  This allows for 1024 (2^10)
1126	      Tier-3 switches with 64 (2^4) servers under each Tier-3 switch.
1127	      Using the Clos topology from section Section 4.1 one can build 32
1128	      clusters with 32 Tier-3 switches each.

1130	   o  The hosts in the network would use BGP to peer with Tier-3
1131	      switches and inject their locator prefixes.  It's desirable, but
1132	      not necessary to configure the route summarization on the network
1133	      switches, depending on the size of the deployment.

1135	   o  Given the small to moderate scale of deployment, four IBGP route-
1136	      reflectors would be deployed in the ILA domain, without the need
1137	      for extra level of aggregation hierarchy.  Each route-reflector
1138	      will need to be configured to accept the BGP sessions from all of
1139	      ILA hosts and be able to maintain thousands of peering sessions.

1141	   o  The ILA hosts and routers should be configured with a single SIR
1142	      prefix, and set up for "push" mapping distribution model, by
1143	      disabling sending the ILA redirect messages.  All ILA mappings
1144	      will be propagated to all hosts and ILA routers via BGP.  Each ILA
1145	      host and router will need to be running a BGP process and peer
1146	      with all four route-reflectors.

1148	   o  The ILA routers will inject the SIR prefix using BGP into the
1149	      network.

1151	   o  For tasks running on ILA hosts, the globally unique ILA
1152	      identifiers should be allocated independently in pseudo-random
1153	      fashion by the host that first starts the task.

1155	   o  As task is moved, the task scheduler will update the mapping and
1156	      publish it via BGP, forcing the ILA routers and ILA hosts to
1157	      update their ILA mapping tables.

1159	   o  ILA domain federation is not used, making every ILA domain
1160	      communicate to each other via the ILA routers only.

1162	13.  IANA Considerations

1164	   None

1166	14.  Manageability Considerations

1168	   ILA requires both one-time deployment efforts, and recurring
1169	   management work.  The initial involvement is reasonably high, as it
1170	   required extending the existing network and host configuration.  It
1171	   does not require any significant changes to the existing
1172	   applications, though, aside from making the applications use newly
1173	   allocated IPv6 addresses.  Majority of the required changes could be
1174	   done without any disruption to the existing infrastructure.

1176	   ILA address management schemes could be arbitrarily complex, but in
1177	   the most basic form do not require any centralized coordination.
1178	   Thus, in many cases it could be a simple local subroutine that
1179	   generates a pseudo-random identifier.

1181	   Recurring management efforts are mostly concentrated on monitoring
1182	   the component of ILA deployment, primarily the ILA routers and the
1183	   BGP route reflectors.  Troubleshooting these components follows the
1184	   standard process and uses regular tooling, with the caveat of having
1185	   more logical components to deal with, primarily the ILA routers and
1186	   the ILA mapping tables on the ILA hosts.  This increases the
1187	   complexity of troubleshooting process, as more state needs to be
1188	   inspected and validated.

1190	15.  Security Considerations

1192	   ILA introduces new security considerations described below.

1194	15.1.  ILA host security

1196	   If unsecured ILA redirect messages are used, the ILA hosts could be
1197	   exposed to cache poisoning attacks.  This calls for ILA redirect
1198	   message authentication, e.g. by use of digital signatures, such as
1199	   [ED25519].  This will also require to use some mechanism for
1200	   propagation of public keys associated with the SIR prefix (the ILA
1201	   routers) and every locator in the domain, since the ILA redirect
1202	   message could be sent by either.

1204	   To prevent tasks from every being able to sent packets directly
1205	   bypassing the mapping layer, the ILA hosts should prohibit the task
1206	   from sending packets toward the address space associated with the
1207	   locators.  Given that all locators will likely to belong to one large
1208	   prefix, this could be accomplished by installing a single filtering
1209	   rule on the ILA host.

1211	15.2.  BGP Security

1213	   Standard means of improving BGP security as described in [RFC7454]
1214	   could be applied to harden the mapping dissemination system.  Among
1215	   them, the most important one is likely to be the "TCP Authentication
1216	   Option" described in the referenced document.  Notice that the BGP
1217	   subsystem used to distribute the ILA mappings is not as vulnerable as
1218	   the Internet BGP mesh, since it only work within the boundaries of a
1219	   privately managed data-center.

1221	15.3.  ILA router security

1223	   ILA routers are primarily susceptible to various form of rate-based
1224	   DDoS attacks.  Primary concern would be overrruning the capabilities
1225	   of ILA routers with too many packets sent from non-ILA hosts toward
1226	   the SIR addresses, or "thundering herds" problem when ILA translation
1227	   tables on the ILA hosts expire synchronously, or due to poisoning
1228	   attack.  Primary ways to address this concern would be closely
1229	   monitoring server utilization and potentially rate-limiting packet
1230	   flow to the ILA router on the upstream network device (ToR switch).

1232	15.4.  Tenant security

1234	   ILA does not natively isolate the tenant traffic from each other, nor
1235	   from the underlying physical infrastructure.  In fact, this is seen
1236	   as one benefit that makes many troubleshooting processes easier.  The
1237	   access control then become responsibility of the tenant itself, by
1238	   employing traffic filtering rules.  To this point, implementing
1239	   filtering rules gets simpler if the tenant is allocated single
1240	   prefix, as opposed to each task getting an unique identifier.

1242	16.  Acknowledgements

1244	   TBD

1246	17.  Informative References

1248	   [RFC4271]  Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
1249	              Border Gateway Protocol 4 (BGP-4)", RFC 4271,
1250	              DOI 10.17487/RFC4271, January 2006,
1251	              <http://www.rfc-editor.org/info/rfc4271>.

1253	   [RFC4456]  Bates, T., Chen, E., and R. Chandra, "BGP Route
1254	              Reflection: An Alternative to Full Mesh Internal BGP
1255	              (IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006,
1256	              <http://www.rfc-editor.org/info/rfc4456>.

1258	   [RFC4684]  Marques, P., Bonica, R., Fang, L., Martini, L., Raszuk,
1259	              R., Patel, K., and J. Guichard, "Constrained Route
1260	              Distribution for Border Gateway Protocol/MultiProtocol
1261	              Label Switching (BGP/MPLS) Internet Protocol (IP) Virtual
1262	              Private Networks (VPNs)", RFC 4684, DOI 10.17487/RFC4684,
1263	              November 2006, <http://www.rfc-editor.org/info/rfc4684>.

1265	   [RFC5291]  Chen, E. and Y. Rekhter, "Outbound Route Filtering
1266	              Capability for BGP-4", RFC 5291, DOI 10.17487/RFC5291,
1267	              August 2008, <http://www.rfc-editor.org/info/rfc5291>.

1269	   [RFC6740]  Atkinson, RJ. and SN. Bhatti, "Identifier-Locator Network
1270	              Protocol (ILNP) Architectural Description", RFC 6740,
1271	              DOI 10.17487/RFC6740, November 2012,
1272	              <http://www.rfc-editor.org/info/rfc6740>.

1274	   [RFC2791]  Yu, J., "Scalable Routing Design Principles", RFC 2791,
1275	              DOI 10.17487/RFC2791, July 2000,
1276	              <http://www.rfc-editor.org/info/rfc2791>.

1278	   [RFC3633]  Troan, O. and R. Droms, "IPv6 Prefix Options for Dynamic
1279	              Host Configuration Protocol (DHCP) version 6", RFC 3633,
1280	              DOI 10.17487/RFC3633, December 2003,
1281	              <http://www.rfc-editor.org/info/rfc3633>.

1283	   [RFC4724]  Sangli, S., Chen, E., Fernando, R., Scudder, J., and Y.
1284	              Rekhter, "Graceful Restart Mechanism for BGP", RFC 4724,
1285	              DOI 10.17487/RFC4724, January 2007,
1286	              <http://www.rfc-editor.org/info/rfc4724>.

1288	   [RFC4760]  Bates, T., Chandra, R., Katz, D., and Y. Rekhter,
1289	              "Multiprotocol Extensions for BGP-4", RFC 4760,
1290	              DOI 10.17487/RFC4760, January 2007,
1291	              <http://www.rfc-editor.org/info/rfc4760>.

1293	   [RFC4786]  Abley, J. and K. Lindqvist, "Operation of Anycast
1294	              Services", BCP 126, RFC 4786, DOI 10.17487/RFC4786,
1295	              December 2006, <http://www.rfc-editor.org/info/rfc4786>.

1297	   [RFC6769]  Raszuk, R., Heitz, J., Lo, A., Zhang, L., and X. Xu,
1298	              "Simple Virtual Aggregation (S-VA)", RFC 6769,
1299	              DOI 10.17487/RFC6769, October 2012,
1300	              <http://www.rfc-editor.org/info/rfc6769>.

1302	   [RFC6830]  Farinacci, D., Fuller, V., Meyer, D., and D. Lewis, "The
1303	              Locator/ID Separation Protocol (LISP)", RFC 6830,
1304	              DOI 10.17487/RFC6830, January 2013,
1305	              <http://www.rfc-editor.org/info/rfc6830>.

1307	   [RFC7454]  Durand, J., Pepelnjak, I., and G. Doering, "BGP Operations
1308	              and Security", BCP 194, RFC 7454, DOI 10.17487/RFC7454,
1309	              February 2015, <http://www.rfc-editor.org/info/rfc7454>.

1311	   [RFC7695]  Pfister, P., Paterson, B., and J. Arkko, "Distributed
1312	              Prefix Assignment Algorithm", RFC 7695,
1313	              DOI 10.17487/RFC7695, November 2015,
1314	              <http://www.rfc-editor.org/info/rfc7695>.

1316	   [RFC7938]  Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
1317	              BGP for Routing in Large-Scale Data Centers", RFC 7938,
1318	              DOI 10.17487/RFC7938, August 2016,
1319	              <http://www.rfc-editor.org/info/rfc7938>.

1321	   [I-D.herbert-nvo3-ila]
1322	              Herbert, T., "Identifier-locator addressing for IPv6",
1323	              draft-herbert-nvo3-ila-03 (work in progress), October
1324	              2016.

1326	   [I-D.lapukhov-bgp-opaque-signaling]
1327	              Lapukhov, P., Aries, E., Marques, P., and E. Nkposong,
1328	              "Use of BGP for Opaque Signaling", draft-lapukhov-bgp-
1329	              opaque-signaling-02 (work in progress), April 2016.

1331	   [I-D.ietf-v6ops-dc-ipv6]
1332	              Lopez, D., Chen, Z., Tsou, T., Zhou, C., and A. Servin,
1333	              "IPv6 Operational Guidelines for Datacenters", draft-ietf-
1334	              v6ops-dc-ipv6-01 (work in progress), February 2014.

1336	   [I-D.lapukhov-bgp-ila-afi]
1337	              Lapukhov, P., "Use of BGP for dissemination of ILA mapping
1338	              information", draft-lapukhov-bgp-ila-afi-01 (work in
1339	              progress), March 2016.

1341	   [I-D.ietf-grow-bmp]
1342	              Scudder, J., Fernando, R., and S. Stuart, "BGP Monitoring
1343	              Protocol", draft-ietf-grow-bmp-17 (work in progress),
1344	              January 2016.

1346	   [I-D.ietf-nvo3-arch]
1347	              Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T.
1348	              Narten, "An Architecture for Data Center Network
1349	              Virtualization Overlays (NVO3)", draft-ietf-nvo3-arch-08
1350	              (work in progress), September 2016.

1352	   [ED25519]  "Ed25519: high-speed high-security signatures",
1353	              <https://ed25519.cr.yp.to>.

1355	   [ETCD]     "coreos/etcd", <https://github.com/coreos/etcd>.

1357	   [MEMCACHED]
1358	              "Memcached", <https://memcached.org/>.

1360	   [ROUTED-DESIGN]
1361	              "High Availability Campus Network Design", 2008, <http://w
1362	              ww.cisco.com/c/en/us/td/docs/solutions/Enterprise/Campus/
1363	              routed-ex.html>.

1365	   [LINUX-NAMESPACES]
1366	              "Namespaces in operation, part 1: namespaces overview",
1367	              2013, <https://lwn.net/Articles/531114/>.

1369	   [IPVLAN]   "IPVLAN Driver HOWTO", 2013,
1370	              <https://github.com/torvalds/linux/blob/master/
1371	              Documentation/networking/ipvlan.txt>.

1373	Author's Address
1374	   Petr Lapukhov
1375	   Facebook
1376	   1 Hacker Way
1377	   Menlo Park, CA  94025
1378	   US

1380	   Email: petr@fb.com