idnits 2.17.1 draft-dunbar-nvo3-nva-mapping-distribution-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 321: '...V is corrupt and MUST be discarded. Th...' RFC 2119 keyword, line 433: '...served bits that MUST be sent as zero ...' RFC 2119 keyword, line 588: '...Err, and SubErr: MUST be sent as zero ...' RFC 2119 keyword, line 622: '...uent QUERY records MUST be ignored and...' RFC 2119 keyword, line 623: '...e entire Query message MAY be ignored....' (19 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 59 has weird spacing: '...chanism for N...' -- The document date (July 8, 2016) is 2842 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC5304' is mentioned on line 1016, but not defined == Missing Reference: 'RFC5310' is mentioned on line 1016, but not defined == Unused Reference: 'RFC4971' is defined on line 1031, but no explicit reference was found in the text == Unused Reference: 'RFC826' is defined on line 1053, but no explicit reference was found in the text == Unused Reference: 'RFC4861' is defined on line 1056, but no explicit reference was found in the text ** Obsolete normative reference: RFC 4971 (Obsoleted by RFC 7981) Summary: 2 errors (**), 0 flaws (~~), 7 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 NV03 working group L. Dunbar 2 Internet Draft D. Eastlake 3 Category: Standards Track Huawei 4 Expires: November 2017 Tom Herbert 5 Google 7 July 8, 2016 9 NVA Address Mapping Distribution (NAMD) Protocol 11 draft-dunbar-nvo3-nva-mapping-distribution-03.txt 13 Status of this Memo 15 This Internet-Draft is submitted in full conformance with the 16 provisions of BCP 78 and BCP 79. This document may not be 17 modified, and derivative works of it may not be created, except 18 to publish it as an RFC and to translate it into languages other 19 than English. 21 Internet-Drafts are working documents of the Internet Engineering 22 Task Force (IETF), its areas, and its working groups. Note that 23 other groups may also distribute working documents as Internet- 24 Drafts. 26 Internet-Drafts are draft documents valid for a maximum of six 27 months and may be updated, replaced, or obsoleted by other 28 documents at any time. It is inappropriate to use Internet- 29 Drafts as reference material or to cite them other than as "work 30 in progress." 32 The list of current Internet-Drafts can be accessed at 33 http://www.ietf.org/ietf/1id-abstracts.txt 35 The list of Internet-Draft Shadow Directories can be accessed at 36 http://www.ietf.org/shadow.html 38 This Internet-Draft will expire on April 8, 2015. 40 Internet-Draft NVA mapping distribution 42 Copyright Notice 44 Copyright (c) 2015 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (http://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with 52 respect to this document. Code Components extracted from this 53 document must include Simplified BSD License text as described in 54 Section 4.e of the Trust Legal Provisions and are provided 55 without warranty as described in the Simplified BSD License. 57 Abstract 59 This draft describes the mechanism for NVA to promptly and 60 incrementally distribute the inner (TS) to outer (NVE) mapping 61 and VN Context to relevant NVEs in a timely manner. 63 Table of Contents 65 1. Introduction...................................................4 66 2. Terminology....................................................4 67 3. Overall Requirement for NVE<->NVA Control Plane................5 68 4. Terminologies and Assumptions..................................6 69 5. Overview of NVA Address Mapping Distribution (NAMD) Protocol...7 70 6. TLV for NVE reachable addresses................................7 71 7. Push Mechanism.................................................8 72 7.1. Requesting Push Service...................................9 73 7.2. Incremental Push Service.................................12 74 8. Pull Mechanism................................................13 75 8.1. Pull Query Format........................................14 76 8.2. Pull Response............................................16 77 8.3. Cache Consistency........................................19 78 8.4. Update Message Format....................................20 79 8.5. Acknowledge Message Format...............................21 80 8.6. Pull Request Errors......................................21 81 8.7. Redundant Pull NVAs......................................21 82 9. Hybrid Mode...................................................21 83 10. Redundancy...................................................22 84 11. Inconsistency Processing.....................................22 85 12. Protocols to consider to carry NAMD messages.................23 87 Internet-Draft NVA mapping distribution 89 13. Security Considerations......................................23 90 14. IANA Considerations..........................................24 91 15. Acknowledgements.............................................24 92 16. References...................................................24 93 16.1. Normative References....................................24 94 16.2. Informative References..................................24 95 Authors' Addresses...............................................25 97 Internet-Draft NVA mapping distribution 99 1. Introduction 101 Section 4.5 of [nvo3-problem-statement] describes the back-end 102 Network Virtualization Authority (NVA) that is responsible for 103 distributing the mapping information for entire overlay system. 104 [nvo3-nve-nva-cp-req] defines the requirement for the control 105 plane between NVA and NVE. 107 This draft describes a mechanism for NVA to promptly and 108 incrementally distribute the inner (TS) to outer (NVE) mapping 109 and VN Context to relevant NVEs in a timely manner. 111 For ease of description, the term "NAMD" is used to represent the 112 NVA Address Mapping Distribution protocol. 114 2. Terminology 116 The following terms are used interchangeably in this document: 118 - The terms "Subnet" and "VLAN" because it is common to 119 map one subnet to one VLAN. 120 - The term "Directory" and "Network Virtualization 121 Authority (NVA)" 122 - The term "NVE" and "Edge" 124 Bridge: IEEE Std 802.1Q-2011 compliant device [802.1Q]. In this 125 draft, Bridge is used interchangeably with Layer 2 126 switch. 128 NAMD Timeout: The time interval that an NVE can assume NVA is 129 not reachable if the NVE hasn't received any updates 130 from NVA during this time. NAMD Timeout is an unsigned 131 byte that gives the amount of time in seconds during 132 which the NVA will send at least three update PDUs. An 133 empty update is used as a keep alive. It defaults to 30 134 seconds. 136 DA: Destination Address 138 DC: Data Center 140 EoR: End of Row switches in data center. Also known as 141 aggregation switches. 143 Internet-Draft NVA mapping distribution 145 End Station: Guest OS running on a physical server or on a 146 virtual machine. An end station in this document has at 147 least one IP address and at least one MAC address, which 148 could be in DA or SA field of a data frame. 150 LISP: Locator/ID Separation Protocol 152 NVA: Network Virtualization Authority 154 NVE: Network Virtualization Edge 156 SA: Source Address 158 Station: A node, or a virtual node, with IP and/or MAC addresses, 159 which could be in the DA or SA of a data frame. 161 ToR: Top of Rack Switch in data center. It is also known as 162 access switches in some data centers. 164 TS: Tenant System 166 VM: Virtual Machines 168 VN: Virtual Network 170 VNID: Virtual Network Instance Identifier 172 3. Overall Requirement for NVE<->NVA Control Plane 174 Section 3.1 of [nvo3-cp-req] describes the basic requirement of 175 inner address to outer address mapping for NVO3. A NVE needs to 176 know the mapping of the Tenant System destination (inner) address 177 to the (outer) address (IP) on the Underlying Network of the 178 egress NVE. 180 Section 3.1 of [nvo3-cp-req] states that a protocol is needed to 181 provide this inner to outer mapping and VN Context to each NVE 182 that requires it and keep the mapping updated in a timely manner. 183 Timely updates are important for maintaining connectivity between 184 Tenant Systems. 186 Internet-Draft NVA mapping distribution 188 4. Terminologies and Assumptions 190 NVAs can be centralized or distributed with each NVA holding the 191 mapping information for a subset of VNs. By saying that an NVA 192 holds mapping information for a VN, it means that the NVA has 193 mapping information for all the TSs in the VN. 195 Centralized NVA means that the NVA holds mapping information for 196 all the VNs in the administrative domain. There could be multiple 197 instances of centralized NVA for redundancy purpose. 199 A NVA could be instantiated on a server/VM attached to a NVE, 200 very much like a TS attached to a NVE, or could be integrated 201 within an NVE. When a NVA is a standalone server/VM attached to a 202 NVE, it has to be reachable via the attached NVE by other NVEs. A 203 NVA can also be instantiated on a NVE that doesn't have any TSs 204 attached. The NVE-NVA control plane for NVA being attached to NVE 205 (like a VM) will require additional functions on NVEs than NVA 206 being embedded in a NVE. 208 NVA should have at least the following information for each TS: 209 . Inner Address: TS (host) Address family (IPv4/IPv6, MAC, 210 virtual network Identifier MPLS/VLAN, etc) 212 . Outer Address: The list of locally attached edges (NVEs); 213 normally one TS is attached to one edge, TS could also be 214 attached to 2 edges for redundancy (dual homing). One TS is 215 rarely attached to more than 2 edges, though it could be 216 possible; 218 . VN Context (VN ID and/or VN Name) 220 . Timer for NVEs to keep the entry when pushed down to or 221 pulled from NVEs. 223 . Optionally the list of interested remote edges (NVEs). This 224 information is for NVA to promptly update relevant edges 225 (NVEs) when there is any change to this TS' attachment to 226 edges (NVEs). However, this information doesn't have to be 227 kept per TS. It can be kept per VN. 229 By saying that a NVE is participating in a VN or the VN is active 230 on the NVE, it means that the VN is enabled on the NVE and there 231 is at least one TS of the VN being attached to the NVE. 233 Internet-Draft NVA mapping distribution 235 5. Overview of NVA Address Mapping Distribution (NAMD) Protocol 237 The inner-outer address mapping could change as TSs move from NVE 238 to another. At any given period, probably only a small set of TSs 239 would move, resulting in a small portion of changes on the inner- 240 outer address mapping. Therefore, it is important to have a 241 mechanism for NVA to send incremental updates to NVEs for the 242 changes instead of entire database of the mapping entries. This 243 document specifies the incremental update messages (TLVs) from 244 NVAs to NVEs, to maintain data consistency between NVAs and NVEs. 246 The NAMD mechanism requires messages to distribute NVA content to 247 all the NVEs, inform the incremental changes to the relevant 248 NVEs, and maintain the database consistency between NVA and NVEs. 249 This document specifies the structures (a.k.a. TLVs) of those 250 messages, which are referred to as NAMD messages throughout this 251 document. The NAMD TLVs can be included in BGP or IGP protocol 252 messages. How they are integrated with the BGP or IPG will be 253 further specified in the corresponding working groups. 255 A NVA can offer services in a Push, Pull model, or the 256 combination of the two. 258 In Push model, the NVE, upon restart or initialization, sends 259 requests for all the interested VNs as a multicast to all the 260 NVAs. NVAs with the requested VNs use NAMD messages to distribute 261 the mapping entries to the requested NVEs. Whenever, there are 262 changes in the mapping entries, NVA uses NAMD messages to only 263 send the changed portion of the entries. 265 In the Pull model, an NVA periodically sends VN scoped broadcast 266 messages to all NVEs. An NVE, upon receiving a unknown unicast or 267 ARP/ND with unknown target NVE, sends the pull request to the NVA 268 that supports the VN that the targets belongs to. 270 6. TLV for NVE reachable addresses 272 The Reachable Interface Addresses (IA) TLV is used to advertise a 273 set of addresses within a VN being attached to (or reachable by) 274 a specific NVE, and optionally the NVE Virtual Access Point. 276 These addresses can be in different address families. For 277 example, it can be used to declare that a particular interface 278 with specified IPv4, IPv6, and 48-bit MAC addresses in some 279 particular VN is reachable from a particular NVE. 281 Internet-Draft NVA mapping distribution 283 This document suggests using the Interface Addresses APPsub-TLV 284 defined by [IA] except using NVE address subTLV in the fourth 285 field shown below: 287 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 288 | Type = TBD | (2 bytes) 289 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 290 | Length | (2 bytes) 291 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 292 | Addr Sets End | (2 bytes) 293 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 294 | NVE Address subTLV ... (variable) 295 +-+-+-+-+-+-+-+-+-+-+-+- 296 | Flags | (1 byte) 297 +-+-+-+-+-+-+-+-+ 298 | Confidence | (1 byte) 299 +-+-+-+-+-+-+-+-+-+- 300 | Template ... (variable) 301 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+ 302 | Address Set 1 (size determined by Template) | 303 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+ 304 | Address Set 2 (size determined by Template) | 305 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+ 306 | ... 307 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+ 308 | Address Set N (size determined by Template) | 309 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+ 310 | optional sub-sub-TLVs ... 311 +-+-+-+-+-+-+-+-+-+-+-+-... 313 Figure 1. The Interface Addresses APPsub-TLV 315 Addr Sets End: The unsigned integer offset of the byte, within 316 the IA APPsub-TLV [IA] value part, of the last byte of the last 317 Address Set. This will be the byte just before the first sub-sub- 318 TLV if any sub-sub-TLVs are present (see Section 3). If this is 319 equal to Length, there are no sub-sub-TLVs. If this is greater 320 than Length or points to before the end of the Template, the IA 321 APPsub-TLV is corrupt and MUST be discarded. This field is always 322 two bytes in size. 324 7. Push Mechanism 326 Under this mode, NVA pushes the inner-outer mapping for all the 327 TSs of the VNs to relevant NVEs. This service is scoped by VN. A 328 Push NVA also advertises whether or not it believes it has pushed 329 complete mapping information for a VN. It might be pushing only a 331 Internet-Draft NVA mapping distribution 333 subset of the mapping and/or reachability information for a VN. 334 The Push Model uses the NAMD messages as its distribution 335 mechanism. 337 With the Push model, if the destination of a data frame arriving 338 at the Ingress NVE can't be found in its inner-outer mapping 339 database that are pushed down from the NVA, the Ingress edge 340 could be configured with one or more of the following policies: 342 - simply drop the data frame, 343 - flood the data frames to other NVEs that have the VN 344 enabled, or 345 - start the "pull" process to get information from Pull 346 NVA. 347 When the NVE is waiting for reply from the Pull 348 process, the NVE can either drop or queue the packet. 350 One drawback of the Push Mode is that it will push more mapping 351 entries to an NVE than needed. Under the normal process of edge 352 cache aging and unknown destination address flooding, rarely 353 used entries would have been removed. It would be difficult for 354 NVA to predict the communication patterns from/to TSs within one 355 VN. Therefore, it is likely that the NVA will push down all the 356 entries for all the VNs that are enabled on the NVE. 358 Another drawback with Push model: there really can't be any 359 source-based policy. It's all or nothing. 361 7.1. Requesting Push Service 363 When a NVE is initialized or re-started, it needs to send request 364 to the relevant NVAs to push down the mapping information for the 365 active VNs on the NVE. NVE could use Virtual Network scoped 366 message to announce all the Virtual Networks in which it is 367 participating to NVAs who have the mapping information for the 368 VNs. A new subTLV (Enabled-VN TLV) is specified here for NVE to 369 indicate all its interested VNs in the NAMD message. The new 370 subTLV can be included in an IGP protocol message or BGP message. 372 For 24-bits VN ID, there could be 16 million VNs. Multiple ways 373 can be used to express the interested VNs: 375 - Starting VN & End VNs & bit map for the VNs in between. 376 - Starting VN & End VN (for the VNs that are contiguous) 378 Internet-Draft NVA mapping distribution 380 - Individual VN listing (for a small number of VNs that are not 381 contiguous) 383 Therefore 3 different types of subTLV are specified: 385 +-+-+-+-+-+-+-+-+ 386 |INT-VN-TYPE-1 | (1 byte) 387 +-+-+-+-+-+-+-+-+ 388 | Length | (1 byte) 389 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 390 | Start VN ID | (4 bytes) 391 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 392 | VNID bit-map.... 393 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 394 Figure 2. Enabled-VN TLV using bit map 396 +-+-+-+-+-+-+-+-+ 397 | INT-VN-TYPE-2 | (1 byte) 398 +-+-+-+-+-+-+-+-+ 399 | Length | (1 byte) 400 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 401 | Start VN ID | (4 bytes) 402 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 403 | End VN ID | (4 byptes) 404 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 405 Figure 3. Enabled-VN TLV using Range 407 Internet-Draft NVA mapping distribution 409 +-+-+-+-+-+-+-+-+ 410 | INT-VN-TYPE-3 | (1 byte) 411 +-+-+-+-+-+-+-+-+ 412 | Length | (1 byte) 413 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 414 | VN ID | (4 bytes) 415 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 416 | VN ID | (4 bytes) 417 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 418 | VN ID | (4 bytes) 419 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 420 | . . . 421 +-+-+-+-+-+-+-+-+-+-+-+-+ 422 Figure 4. Enabled-VN TLV using list 424 - Type: indicating different ways to express the VNs that NVE is 425 participating: INT-VN-TYPE-1 is for using bit map to express the 426 interested VNs; INT-VN-TYPE-2 is for using range to express the 427 interested VNs (if the interested VNs are contiguous); IT-VN- 428 TYPE-3 is for using individual VN list to express the interested 429 VNs. 431 - Length: Variable. 433 - RESV: 4 reserved bits that MUST be sent as zero and ignored on 434 receipt. 436 - Start VN ID: The 24-bit VN-ID that is represented by the high- 437 order bit of the first byte of the VN-ID bit-map. 439 VN-ID bit-map: The highest-order bit indicates the VN equal to 440 the start VN ID, the next highest bit indicates the VN equal to 441 start VN ID + 1, continuing to the end of the VN bit-map field. 443 If this sub-TLV occurs more than once in a Hello, the set of 444 enabled VNs is the union of the sets of VNs indicated by each of 445 the Enabled-VLAN sub-TLVs in the Hello. 447 When NVA is distributed, there could be multiple NVAs with each 448 hosting mapping information for a subset of VNs. 450 Each NVA advertises its availability to push mapping information 451 for a particular virtual network to all NVEs who participate in 452 the VN. NVEs subscribe the relevant NVAs. 454 Internet-Draft NVA mapping distribution 456 The subscription is VN scoped, so that a NVA doesn't need to push 457 down the entire set of mapping entries. Each Push NVA also has a 458 priority. For robustness, the one or two NVAs with the highest 459 priority are considered as Active in pushing information for the 460 VN to all NVEs who have subscribed for that VN. 462 7.2. Incremental Push Service 464 Whenever there is any change in TS' association to an NVE, which 465 can be triggered by TS being added, removed, or de-commissioned, 466 an incremental update has to be sent to the NVEs that are 467 impacted by the change. Therefore, proper sequence numbers have 468 to be maintained by NVA and edges NVEs. NAMD incremental message 469 is used to update and maintain the database consistency between 470 NVAs and NVEs. We assume that NVA gets notification from an 471 authoritative source, such as VM management system when TS-NVE 472 attachment changes occur. 474 A new TLV is needed for to carry NAMD timeout value and a flag 475 for NVA to indicate it has completed all updates. 477 If the Push NVA is configured to believe it has complete mapping 478 information for VN X then, after it has actually transmitted all 479 of its messages for VN X it sets the Complete Push (CP) bit to 480 one. It then maintains the CP bit as one as long as it is Active. 482 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 483 | Type | (2 bytes) 484 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 485 | Length | (2 bytes) 486 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 487 |R| Priority | (1 byte) 488 +-+-+-+-+-+-+-+-+ 489 | NAMD Timeout | (1 byte) 490 +-+-+-+-+-+-+-+-+ 491 | Flags | (1 byte) 492 +---------------+ 493 | Reserved for expansion (variable) 494 +-+-+-+-... 495 Figure 3. NAMD Complete TLV 497 Flags: A byte of flags defined as follows: 499 Internet-Draft NVA mapping distribution 501 0 1 2 3 4 5 6 7 502 +---+---+---+---+---+---+---+---+ 503 | UN|CP | RESV | 504 +---+---+---+---+---+---+---+---+ 506 The UN flag indicates that the NVA will accept and properly 507 process NVA- PDUs sent by unicast 509 The CP flag is to indicate that NVA has completed its update. 511 8. Pull Mechanism 513 Under this mode, an NVE pulls the mapping entries from the NVA 514 when its cache doesn't have the mapping entries. 516 The main advantage of Pull Mode is that the mapping is stored 517 only where it needs to be stored and only when it is required. In 518 addition, in the Pull Mode, NVEs can age out mapping entries if 519 they haven't been used for a certain period of time. Therefore, 520 each NVE will only keep the entries that are frequently used, so 521 its mapping table size will be smaller than a complete table 522 pushed down from NVA. 524 The drawback of Pull Mode is that it might take some time for 525 NVEs to pull the needed mapping from NVA. Before NVE gets the 526 response from NVA, the NVE has to buffer the subsequent data 527 frames with destination address to the same target. The buffer 528 could overflow before the NVE gets the response from NVA. 529 However, this scenario should not happen very often in data 530 center environment because most likely the TSs are end systems 531 which have to wait for (TCP) acknowledgement before sending 532 subsequent data frames. Another option is forward, not flood, 533 subsequent frames to a default location, if the NVE is configured 534 with a default node that has the ability to forward data frames 535 when the NVE doesn't have the mapping information. This node can 536 be the gateway, or a re-encapsulating NVE in NAMD context. 538 It worth noting that the practice of an edge waiting and dropping 539 packets upon receiving an unknown DA is not new. Most deployed 540 routers today drop packets while waiting for target addresses to 541 be resolved. It is too expensive to queue subsequent packets 542 while resolving target address. The routers send ARP/ND requests 543 to the target upon receiving a packet with DA not in its ARP/ND 544 cache and wait for an ARP or ND responses. This practice 545 minimizes flooding when targets don't exist in the subnet. When 546 the target doesn't exist in the subnet, routers generally re-send 548 Internet-Draft NVA mapping distribution 550 an ARP/ND request a few more times before dropping the packets. 551 The holding time by routers to wait for an ARP/ND response when 552 the target doesn't exist in the subnet can be longer than the 553 time taken by the Pull Mode to get mapping from NVA. 555 8.1. Pull Query Format 557 Here are some events that can trigger the pulling process: 559 o An NVE receives a data frame from the attached TSs with a 560 destination whose attached NVE is unknown, or 561 o The NVE receives an ingress ARP/ND request for a target 562 whose link address (MAC) or attached NVE is unknown. 564 Each Pull request can have queries for multiple inner-outer 565 mapping entries. The message format is defined below: 567 0 1 2 3 568 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 569 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 570 | Ver | Type | Flags | Count | Err | SubErr | 571 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 572 | Sequence Number | 573 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 574 | QUERY 1 575 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-... 576 | QUERY 2 577 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-... 578 | ... 579 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-... 580 | QUERY K 581 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-... 582 Figure 4. Pull Query TLV 584 Type: 1 for Query. Queries received by an NVE that is not a 585 Pull NVA result in an error response unless inhibited by rate 586 limiting. 588 Flags, Err, and SubErr: MUST be sent as zero and ignored on 589 receipt. 591 Count: Number of QUERY Records present. A Query message Count 592 of zero is explicitly allowed, for the purpose of pinging a 593 Pull NVA server to see if it is responding. On receipt of such 595 Internet-Draft NVA mapping distribution 597 an empty Query message, a Response message that also has a 598 Count of zero is sent unless inhibited by rate limiting. 600 QUERY: Each QUERY Record within a Pull Directory Query message 601 is formatted as follows: 603 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 604 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 605 | SIZE | RESV | QTYPE | 606 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 607 If QTYPE = 1 608 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 609 | AFN | 610 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 611 | Query address ... 612 +--+--+--+--+--+--+--+--+--+--+--... 613 If QTYPE = 2, 3, 4, or 5 614 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 615 | Query frame ... 616 +--+--+--+--+--+--+--+--+--+--+--... 618 SIZE: Size of the QUERY record in bytes as an unsigned integer 619 starting after the SIZE field and following byte. Thus the 620 minimum legal value is 2. A value of SIZE less than 2 indicates 621 a malformed QUERY record. The QUERY record with the illegal 622 SIZE value and any subsequent QUERY records MUST be ignored and 623 the entire Query message MAY be ignored. 625 RESV: A block of reserved bits. MUST be sent as zero and 626 ignored on receipt. 628 QTYPE: There are several types of QUERY Records currently 629 defined in two classes as follows: (1) a QUERY Record that 630 provides an explicit address and asks for all addresses for the 631 interface specified by the query address and (2) a QUERY Record 632 that includes a frame. The fields of each are specified below. 633 Values of QTYPE are as follows: 635 QTYPE Description 636 ----- ----------- 637 0 reserved 638 1 address query 639 2 ARP query frame 640 3 ND query frame 641 4 RARP query frame 642 5 Unknown unicast MAC query frame 643 6-14 assignable by IETF Review 645 Internet-Draft NVA mapping distribution 647 15 reserved 649 AFN: Address Family Number of the query address. 651 Address Query: The query is asking for any other addresses, and 652 the address of NVE from which they are reachable, that 653 correspond to the same interface, within the VN of the query. 654 Typically that would be either (1) a MAC address with the 655 querying NVE primarily interested in the NVE by which that MAC 656 address is reachable, or (2) an IP address with the querying 657 NVE interested in the corresponding MAC address and the NVE by 658 which that MAC address is reachable. But it could be some other 659 address type. 661 Query Frame: Where a QUERY Record is the result of an ARP, ND, 662 RARP, or unknown unicast MAC destination address, the ingress 663 NVE MAY send the frame to a Pull NVA if the frame is small 664 enough that the resulting Query message not exceeding the MTU. 666 If no response is received to a Pull Directory Query message 667 within a timeout configurable in milliseconds that defaults to 668 200, the Query message should be re-transmitted with the same 669 Sequence Number up to a configurable number of times that 670 defaults to three. If there are multiple QUERY Records in a 671 Query message, responses can be received to various subsets of 672 these QUERY Records before the timeout. In that case, the 673 remaining unanswered QUERY Records should be re-sent in a new 674 Query message with a new sequence number. If an NVE is not 675 capable of handling partial responses to queries with multiple 676 QUERY Records, it MUST NOT send a Request message with more 677 than one QUERY Record in it. 679 8.2. Pull Response 681 There are several possibilities of the Pull Response: 683 1. Valid inner-outer address mapping, coupled with the valid 684 timer indicating how long the entry can be cached by the 685 NVE. 686 The timer for cache should be short in an environment where 687 VMs move frequently. The cache timer can also be configured. 689 Internet-Draft NVA mapping distribution 691 2. The target being queried is not available. The response 692 should include the policy if requester should forward data 693 frame in legacy way, or drop the data frame. 695 3. The requestor is administratively prohibited from getting an 696 informative response. 698 Pull NVA Response messages are sent as unicast to the 699 requesting NVE. Responses are sent with the same VN. The 700 specific data format is as follows: 702 0 1 2 3 703 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 704 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 705 | Ver | Type | Flags | Count | Err | SubErr | 706 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 707 | Sequence Number | 708 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 709 | RESPONSE 1 710 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-... 711 | RESPONSE 2 712 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-... 713 | ... 714 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-... 715 | RESPONSE K 716 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-... 717 Figure 5. Pull Response TLV 719 Type: 2 = Response. 721 Flags: MUST be sent as zero and ignored on receipt. 723 Count: Count is the number of RESPONSE Records present in the 724 Response message. 726 Sequence Number: There are many Pull Queries from NVEs; each 727 Pull Query has a different sequence number. The Sequence 728 Number in the Pull Response reflects the sequence number for 729 the query. 731 Err, SubErr: A two part error code. Zero unless there was an 732 error in the Query message, for which case see Section 3.5. 734 Internet-Draft NVA mapping distribution 736 RESPONSE: Each RESPONSE record within a Pull NVA Response 737 message is formatted as follows: 739 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 740 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 741 | SIZE |OV| RESV | Index | 742 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 743 | Lifetime | 744 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 745 | Response Data ... 746 +--+--+--+--+--+--+--+--+--+--+--... 748 SIZE: Size of the RESPONSE Record in bytes starting after the 749 SIZE field and following byte. Thus the minimum value of SIZE 750 is 2. If SIZE is less than 2, that RESPONSE Record and all 751 subsequent RESPONSE Records in the Response message MUST be 752 ignored and the entire Response message MAY be ignored. 754 OV: The overflow flag. Indicates, as described below, that 755 there was too much Response Data to include in one Response 756 message. 758 RESV: Four reserved bits that MUST be sent as zero and ignored 759 on receipt. 761 Index: The relative index of the QUERY Record in the Query 762 message to which this RESPONSE Record corresponds. The index 763 will always be one for Query messages containing a single 764 QUERY Record. If the Index is larger than the Count that was 765 in the corresponding Query, that RESPONSE Record MUST be 766 ignored and subsequent RESPONSE Records or the entire Response 767 message MAY be ignored. 769 Lifetime: The length of time for which the response should be 770 considered valid in units of 200 milliseconds except that the 771 values zero and 2**16-1 are special. If zero, the response can 772 only be used for the particular query from which it resulted 773 and MUST NOT be cached. If 2**16-1, the response MAY be kept 774 indefinitely but not after the Pull NVA goes down or becomes 775 unreachable. The maximum definite time that can be expressed 776 is a little over 3.6 hours. 778 Response Data: There are various types of RESPONSE Records. 780 - If the Err field is non-zero, then the Response Data is a 781 copy of the corresponding QUERY Record data, that is, either 782 an AFN followed by an address or a query frame. 784 Internet-Draft NVA mapping distribution 786 - If the Err field is zero and the corresponding QUERY Record 787 was an address query, then the Response Data is the contents 788 of an Interface Addresses APPsub-TLV [IA]. The maximum size of 789 such contents is 253 bytes in the case when SIZE is 255. 791 - If the Err field is zero and the corresponding QUERY Record 792 was a frame query, then the Response data consists of the 793 response frame for ARP, ND, or RARP and a copy of the frame 794 for unknown unicast destination MAC. 796 Multiple RESPONSE Records can appear in a Response message 797 with the same index if the answer to a QUERY Record consists 798 of multiple Interface Address APPsub-TLV contents. This would 799 be necessary if, for example, a MAC address within a Data 800 Label appears to be reachable by multiple NVEs. However, all 801 RESPONSE Records to any particular QUERY Record MUST occur in 802 the same Response message. If a Pull NVA holds more mappings 803 for a queried address than will fit into one Response message, 804 it selects which to include by some method outside the scope 805 of this document and sets the overflow flag (OV) in all of the 806 RESPONSE Records responding to that query address. 808 If no response is received from a Pull request within a 809 configurable timeout, the request should be re-transmitted 810 with the same Sequence Number up to a configurable number of 811 times that defaults to three. 813 8.3. Cache Consistency 815 It is important that the cached information be kept consistent 816 with the actual placement of VMs. Therefore, it is highly 817 desirable to have a mechanism to prevent NVEs from using the 818 staled mapping entries. 820 When there is any change in a Pull NVA, such as an entry being 821 deleted or new entry added, and there may be unexpired stale 822 information at some NVEs, the Pull NVA MUST send an 823 unsolicited Update message to the relevant NVEs. 825 To achieve this goal, a Pull NVA server MUST maintain one of 826 the following, in order of increasing specificity. 828 1. An overall record per VN of when the last returned query 829 data will expire at a requestor and when the last query record 830 specific negative response will expire. 832 Internet-Draft NVA mapping distribution 834 2. For each unit of data (IA APPsub-TLV Address Set) held by 835 the NVA and each address about which a negative response was 836 sent, when the last expected response with that unit or 837 negative response will expire at a requester. 839 Note: It is much more important to cache negative reply, 840 because there are many invalid address queries. Study has 841 shown that for each valid ND query, there are 100's of invalid 842 address queries. 844 3. For each unit of data held by the NVA and each address 845 about which a negative response was sent, a list of NVEs that 846 were sent that unit as the response or sent a negative 847 response to the address, with the expected time to expiration 848 at each of them. 850 8.4. Update Message Format 852 An Update message is formatted as a Response message except 853 that the Type field in the message header is a different 854 value. 856 Update messages are initiated by a Pull NVA. The Sequence 857 number space used is controlled by the originating Pull NVA 858 and different from Sequence number space used in a Query and 859 the corresponding Response that are controlled by the querying 860 NVE. 862 The Flags field of the message header for an Update message is 863 as follows: 865 +---+---+---+---+ 866 | F | P | N | R | 867 +---+---+---+---+ 869 F: The Flood bit. If zero, the response is to be unicast. If 870 F=1, it is multicast to relevant NVEs. 872 P, N: Flags used to indicate positive or negative Update 873 messages. P=1 indicates positive. N=1 indicates negative. Both 874 may be 1 for a flooded all addresses Update. 876 R: Reserved. MUST be sent as zero and ignored on receipt 878 Internet-Draft NVA mapping distribution 880 8.5. Acknowledge Message Format 882 An Acknowledge message is sent in response to an Update to 883 confirm receipt or indicate an error unless response is 884 inhibited by rate limiting. It is also formatted as a Response 885 message. 887 If there are no errors in the processing of an Update message, 888 the message is essentially echoed back with the Type changed 889 to Acknowledge. 891 If there was an overall or header error in an Update message, 892 it is echoed back as an Acknowledge message with the Err and 893 SubErr fields set appropriately. 895 If there is a RESPONSE Record level error in an Update 896 message, one or more Acknowledge messages may be returned. 898 8.6. Pull Request Errors 900 If errors occur at the query level, they MUST be reported in a 901 response message separate from the results of any successful 902 queries. If multiple queries in a request have different 903 errors, they MUST be reported in separate response messages. 904 If multiple queries in a request have the same error, this 905 error response MAY be reported in one response message. 907 8.7. Redundant Pull NVAs 909 There could be multiple NVAs holding mapping information for a 910 particular VN for reliability or scalability purposes. Pull 911 NVAs advertise themselves by having the Pull Directory flag on 912 in their Interested VNs sub-TLV [rfc6326bis]. 914 A pull request can be sent to any of them that is reachable 915 but it is RECOMMENDED that pull requests be sent to a NVA that 916 is least cost from the requesting NVE. 918 9. Hybrid Mode 920 For some edge nodes that have great number of VNs enabled and 921 combined number of TSs under all those VNs are large, managing 922 the inner-outer address mapping for TSs under all those VNs 923 can be a challenge. This is especially true for Data Center 925 Internet-Draft NVA mapping distribution 927 gateway nodes, which need to communicate with a majority of 928 VNs if not all. 930 For those NVE nodes, a hybrid mode should be considered. That 931 is the Push Mode being used for some VNs, and the Pull Mode 932 being used for other VNs. It is the network operator's 933 decision by configuration as to which VNs' mapping entries are 934 pushed down from NVA and which VNs' mapping entries are 935 pulled. 937 In addition, NVA can inform the NVE to use legacy way to 938 forward if it doesn't have the mapping information, or the NVE 939 is administratively prohibited from forwarding data frame to 940 the requested target. 942 10. Redundancy 944 For redundancy purpose, there should be multiple NVAs that hold 945 mapping information for each VN. At any given time, only one or a 946 small number of push NVAs is considered as active for a 947 particular VN. All NVAs should announce its capability and 948 priority to all the edges. 950 11. Inconsistency Processing 952 If an NVE notices that a Push NVA is no longer reachable, it MUST 953 ignore any mapping entries from that NVA because it is no longer 954 being updated and may be stale. 956 There may be transient conflicts between mapping information from 957 different Push NVAs or conflicts between locally learned 958 information and information received from a Push NVA. NVA may 959 have a confidence level with address table information so, in 960 case of such conflicts, information with a higher confidence 961 value is preferred over information with a lower confidence. In 962 case of equal confidence, Push NVA information is preferred to 963 locally learned information and if information from Push NVAs 964 conflicts, the information from the higher priority Push NVA is 965 preferred. 967 Internet-Draft NVA mapping distribution 969 12. Protocols to consider to carry NAMD messages 971 NAMD messages can be carried by IGP, BGP, or even OVSDB. NVO3 WG 972 only focuses on specifying the NAMD message structure. How NAMD 973 TLVs are integrated with BGP or IGP messages will be discussed in 974 the corresponding WGs, e.g. BESS WG. 976 OVSDB (Open vSwitch Database Management protocol - RFC7047 by 977 individual submission), is to bootstrap a vSwitch with the needed 978 configuration (e.g. number of flow tables, the pipeline among 979 those flow tables, path/link cost, Timer for Spanning Tree, Hello 980 Timer, enabling Multicast snooping, etc). After OVSDB bootstrap a 981 vSwitch, OpenFlow is used to dynamically pass down the flow 982 entries. 984 Theoretically, some components of OVSDB can be potentially 985 adopted (with update) to achieve the control plane between NVA 986 and NVE. For example, changes to OVSDB are needed to address: 988 - How Edge nodes request for Push? 990 - How Edge nodes express the participated VNs? 992 - How NVA express the supported VNs ranges/list/? 994 - How Edge nodes feedback newly discovered attached TSs to 995 NVA 997 - How Edge nodes exchange mapping among themselves. 999 13. Security Considerations 1001 Incorrect information in NVA can result in a variety of security 1002 threats including the following: 1004 Incorrect directory mappings can result in data being delivered 1005 to the wrong hosts/VMs, or set of hosts in the case of multi- 1006 destination packets, violating security policy. 1008 Missing or incorrect data in NVA can result in denial of service 1009 due to sending data packets to black holes or discarding data on 1010 ingress due to incorrect information that their destinations are 1011 not reachable. 1013 Internet-Draft NVA mapping distribution 1015 Push NVA data messages can be authenticated by including an 1016 Authentication TLV. See [RFC5304] and [RFC5310]. 1018 14. IANA Considerations 1020 This section gives IANA allocation and registry considerations. 1022 15. Acknowledgements 1024 Special thanks to David Black, Dino Farinacci, Mingui Zhang, 1025 XiaoHu Xu for valuable suggestions and comments to this draft. 1027 16. References 1029 16.1. Normative References 1031 [RFC4971] J. Vasseur et al, "Intermediate System to Intermediate 1032 System (IS-IS) Extensions for Advertising Router 1033 Information", July 2007. 1035 [nvo3-nve-nva-cp-req] draft-ietf-nvo3-nve-nva-cp-req-00, "Network 1036 Virtualization NVE to NVA Control Protocol 1037 Requirements", Kreeger, et al. July 31, 3013. 1039 [IA] - Eastlake, D., L. Yizhou, R. Perlman, "TRILL: Interface 1040 Addresses APPsub-TLV", draft-ietf-trill-ia-appsubtlv, 1041 work in progress. 1043 16.2. Informative References 1045 [802.1Q] IEEE Std 802.1Q-2011, "IEEE Standard for Local and 1046 metropolitan area networks - Virtual Bridged Local Area 1047 Networks", May 2011. 1049 [802.1Qbg] IEEE Std 802.1Qbg-2012, "Media Access Control (MAC) 1050 Bridges and Virtual Bridged Local Area Networks-Edge 1051 Virtual Bridging", July 2012. 1053 [RFC826] Plummer, D., "An Ethernet Address Resolution Protocol", 1054 RFC 826, November 1982. 1056 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 1057 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 1058 September 2007. 1060 Internet-Draft NVA mapping distribution 1062 Authors' Addresses 1064 Linda Dunbar 1065 Huawei Technologies 1066 5430 Legacy Drive, Suite #175 1067 Plano, TX 75024, USA 1068 Phone: (469) 277 5840 1069 Email: linda.dunbar@huawei.com 1071 Donald Eastlake 1072 Huawei Technologies 1073 155 Beaver Street 1074 Milford, MA 01757 USA 1075 Phone: 1-508-333-2270 1076 Email: d3e3e3@gmail.com 1078 Tom Herbert 1079 Google 1080 Email: therbert@google.com