idnits 2.17.1 draft-rp-trill-parent-selection-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 911 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 23, 2017) is 2619 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- -- Missing reference section? 'RFC2119' on line 842 looks like a reference -- Missing reference section? 'RFC6325' on line 847 looks like a reference -- Missing reference section? 'RFC7780' on line 852 looks like a reference -- Missing reference section? 'RFC7783' on line 858 looks like a reference -- Missing reference section? 'RFC4971' on line 863 looks like a reference -- Missing reference section? 'RFC7176' on line 837 looks like a reference Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Trill WG R. Parameswaran, 2 INTERNET-DRAFT Brocade Communications, Inc. 3 Intended Status:Experimental February 23, 2017 4 Expires: August 27, 2017 6 TRILL: Parent node Shifts in Tree Construction, Mitigation. 7 8 Abstract 10 This draft documents a known problem in the Trill tree construction 11 mechanism and offers two different approaches to solve the problem. 13 Status of This Memo 15 This Internet-Draft is submitted to IETF in full conformance with the 16 provisions of BCP 78 and BCP 79. 18 Distribution of this document is unlimited. Comments should be sent 19 to the authors or the TRILL working group mailing list: 20 trill@ietf.org. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF), its areas, and its working groups. Note that 24 other groups may also distribute working documents as Internet- 25 Drafts. 27 Internet-Drafts are draft documents valid for a maximum of six months 28 and may be updated, replaced, or obsoleted by other documents at any 29 time. It is inappropriate to use Internet-Drafts as reference 30 material or to cite them other than as "work in progress." 32 The list of current Internet-Drafts can be accessed at 33 http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft 34 Shadow Directories can be accessed at 35 http://www.ietf.org/shadow.html. 37 Terminology and Acronyms. 39 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 40 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 41 document are to be interpreted as described in [RFC2119]. 43 Table of Contents 45 1. Introduction...............................................1 46 2. Tree construction in Trill.................................2 47 3. Issues with the Trill tree construction algorithm..........2 48 4. Alternative A - Using the Affinity sub-TLV.................4 49 5. Alternative B - Variant of the Djikstra SPF algorithm......7 50 5.1 Choice of the policy function........................11 51 6. Network wide selection of computation algorithm...........13 52 7. Relationship to draft-ietf-trill-resilient-trees..........14 53 8. Security Considerations...................................16 54 9. IANA Considerations.......................................16 55 10. Informative References....................................16 57 1. Introduction. 59 Trill is a data center technology that uses link-state routing 60 mechanisms in a layer 2 setting, and serves as a replacement 61 for spanning-tree. Trill uses trees rooted at pre-determined nodes 62 as a way to distribute multi-destination traffic. Multi-destination 63 traffic includes traffic such as layer-2 broadcast frames, unknown 64 unicast flood frames, and layer 2 traffic with multicast MAC 65 addresses (collectively referred to as BUM traffic). Multi-destination 66 traffic is typically hashed onto one of the available trees and sent 67 over the tree, potentially reaching all nodes in the network (hosts 68 behind which may own/need the packet in question). 70 2. Tree construction in Trill. 72 Tree construction in Trill is defined by [RFC6325], with additional 73 corrections defined in [RFC7780]. 75 The tree construction mechanism used in Trill codifies 76 certain tree construction steps which make the resultant trees 77 very brittle. Specifically, the parent selection mechanism in Trill 78 causes problems in case of node failures. Trill uses the following rule 79 - when constructing an SPF tree, if there are multiple possible 80 parents for a given node (i.e. if multiple upstream nodes can 81 potentially pull in a given node during SPF, all at the same 82 cumulative cost, then the parent selection is imposed in the 83 following manner): 85 [RFC6325]: 86 "When building the tree number j, remember all possible 87 equal cost parents for node N. After calculating the entire 'tree' 88 (actually, directed graph), for each node N, if N has 'p' parents, 89 then order the parents in ascending order according to the 90 7-octet IS-IS ID considered as an unsigned integer, and number them 91 starting at zero. For tree j, choose N's parent as choice j mod p." 93 There is an additional correction posted to this in [RFC7780]: 95 [RFC7780], Section 3.4: 97 "Section 4.5.1 of [RFC6325] specifies that, when building 98 distribution tree number j, node (RBridge) N that has multiple 99 possible parents in the tree is attached to possible parent 100 number j mod p. Trees are numbered starting with 1, but possible 101 parents are numbered starting with 0. As a result, if there are 102 two trees and two possible parents, then in tree 1 parent 1 will 103 be selected, and in tree 2 parent 0 will be selected. 105 This is changed so that the selected parent MUST be (j-1) mod p. As 106 a result, in the case above, tree 1 will select parent 0, and tree 2 107 will select parent 1. This change is not backward compatible with 108 [RFC6325]. If all RBridges in a campus do not determine distribution 109 trees in the same way, then for most topologies, the RPFC will drop 110 many multi-destination packets before they have been properly 111 delivered." 113 3. Issues with the Trill tree construction algorithm. 115 With this tree construction mechanism in mind,let's look at 116 the Spine-Leaf topology presented below and consider the 117 calculation of Tree number 2 in Trill. Assume all the links in the tree 118 are at the same cost. 120 A-- --B 121 / \ \/ /\ 122 / \/\ _/_ \ 123 /__ _/\ / \\ 124 // \/ \\ 125 1 2 3 126 \ | / 127 \ | / 128 \ | / 129 \ | / 130 \ | / 131 \ | / 132 \ |/ 133 C 135 Assume that in the above topology, when ordered by 7-octet ISIS-id, 136 1 < 2 < 3 holds and that the root for Tree number 2 is A. Given the 137 ordered set {1, 2, 3} , these nodes have the following indices (with a 138 starting index of 0): 140 Node Index 141 1 0 142 2 1 143 3 2 145 Given the SPF constraint and that the tree root is A, the parent for 146 nodes 1,2, and 3 will be A. However, when the SPF algorithm tries to 147 pull B or C into the tree, we have a choice of parents, namely 1, 2, 148 or 3. 150 Given that this is tree 2, the parent will be the one with index 151 (2-1) mod 3 (which is equal to 1). Hence the parent for node B will be 152 node 2. 153 A 154 /|\ 155 / | \ 156 / | \ 157 1 2 3 158 /\ 159 / \ 160 B C 162 However, due to Trill's parent selection algorithm, the sub-tree 163 rooted at Node 2 will be impacted even if Node 1 or Node 3 164 go down. 166 Take the case where Node 1 goes down. Tree 2 must now be 167 re-computed (this is normal) - but now, when the SPF computation is 168 underway, when the SPF process tries to pull in B, the list of 169 potential parents for B now are {2 and 3}. So, after ordering these 170 by ISIS-Id as {2, 3} (where 2 is considered to be at index of 0 and 3 171 is considered to be at index 1), for tree 1, we apply Trill's formula 172 of: 174 Parent's index = (TreeNumber-1) mod Number_of_parents. 175 = (2-1) mod 2 176 = 1 mod 2 177 = 1 (which is the index of Node 3) 179 The re-calculated tree now looks as shown below. The shift in 180 parent nodes (for B) may cause disruption to live traffic in the 181 network, and is unnecessary in absolute terms because the existing 182 parent for node B, node 2, was not perturbed in any way. 184 A 185 / \ 186 / \ 187 / \ 188 2 3 189 /\ 190 / \ 191 B C 193 Aside from the disruption posed by the change in the tree links, 194 depending upon how the concerned rbridges stripe vlans/FGLs across 195 trees and how they may prune these, additional disruption is possible 196 if the forwarding state on the new parent rbridge is not primed to 197 match the new tree structure. This churn could simply be avoided 198 with a better algorithm/approach. 200 The parent shift issue noted above can be solved in at least two 201 different ways: 203 a) by means of the affinity sub-TLV. 205 b) by means of a network wide policy based parent selection 206 mechanism. 208 While the techniques identified in this draft have an immediate 209 benefit when applied to spine/leaf networks popular in data-center 210 designs, nothing in either of the approaches outlined below assumes a 211 spine-leaf network. The techniques outlined below will work on any 212 connected graph. Furthermore, no symmetry in link-cost is assumed. 214 4. Alternative A - Using the Affinity sub-TLV. 216 At a high level, this problem can be solved by having the affected 217 parent send out an affinity sub-TLV identifying the children for 218 which it wants to preserve the parent-child relationship, subject to 219 network events which may change the structure of the tree. The 220 preferred parent node would send out an affinity sub-TLV with 221 multiple affinity records, one per child node, listing the 222 concerned tree number. 224 It would be sufficient to have a local configuration option (e.g. 225 a CLI) at one of the nodes which is deemed to be the preferred parent. 226 In such case, the following steps may provide a way to implement this 227 proposal: 229 a. The operator locally configures the preferred parent to indicate 230 its stickiness in tree construction for a specific tree number 231 and tree root via the affinity sub-TLV. This can be done before 232 tree construction if the operator consults the 7 octet ISIS-ID 233 relative ordering of the concerned nodes and infers upfront which 234 of the potential parent nodes would become the parent node for a 235 given set of children on that tree number under the Trill tree 236 calculation mechanism. The operator assumes the responsibility 237 of configuring the preferred parent stickiness on only one node 238 amongst a set of sibling nodes relative to the tree root for 239 that tree number. 241 b. The very first tree calculation uses the default Trill specified 242 parent selection rules. The configured node advertises its parent 243 preference via the affinity sub-TLV when it completes a normal 244 Trill specified tree calculation, and finds itself the parent of 245 multiple child nodes per the calculation. The affinity sub-TLV 246 must reflect the appropriate tree number and the child nodes for 247 which the concerned node is a parent node. The affinity sub-TLV 248 should be published when the tree is deemed to have converged 249 and it may be necessary to start a timer when triggering the 250 tree calculation, to track the convergence. Alternately, the 251 publication of the affinity sub-TLV may be triggered by the 252 download of the routes to the (L2) RIB, which typically happens 253 after successful computation of the entire tree (however, see 254 vii. below). 256 c. When any change event happens in the network, one which forces a 257 tree re-calculation for the concerned tree, the configured node 258 should run through the normal Trill tree calculation agnostic of 259 the fact that it has published an affinity sub-TLV i.e the 260 node's own affinity sub-TLV should not be directly used in 261 establishing parent relationships. During the tree 262 calculation, if the node turns out to be on the list of potential 263 parents for some or all of the child nodes for which it 264 published the affinity sub-TLV (disregarding its possible 265 exclusion as parent due to 7 octet ISIS ID ordering logic), or 266 for any other child nodes for which it was not previously a 267 parent node, then it should react in the following manner: 269 i. If the node is still a potential parent for some of the 270 children identified in the existing affinity sub-TLV, it should 271 start a timer upon whose expiration, the node would intend to 272 send out an updated affinity sub-TLV identifying the correct 273 sub-set of children for which the node aspires to continue the 274 parent relationship. This case would also apply if there are 275 new child nodes for which the node is now a parent (however, 276 see the conflicted affinity sub-TLV rules below, and in vii 277 further below). 279 For its own tree computation, the node should use itself as 280 parent in order to pull the set of children identified during 281 the SPF run into the tree unless another affinity sub-TLV is 282 seen for the same tree number with an overlap in the set of 283 child nodes such that the other affinity sub-TLV would win 284 according to the conflict resolution rules in section 5.3 285 of [RFC7783]. In such case, this node should not publish 286 an affinity sub-TLV (and retract it if it is already 287 published), and honor the other node's affinity sub-TLV 288 instead. 290 ii. if the tree structure changes such that it is no longer a 291 potential parent for any of the child nodes in the advertised 292 affinity sub-TLV, then it must start a timer upon whose 293 expiration, it would intend to retract the affinity sub-TLV. 294 In this case, the default Trill tie-break rule would need to 295 be used during SPF construction for the vertices that were 296 children of this node previously. 298 iii. If there are any additional tree computations while the timer is 299 running, the timer should be re-started/extended in order to 300 absorb the interim network events. It is possible that the 301 intended action at the expiration of the timer may change 302 meanwhile. The timer needs to be large enough to absorb 303 multiple network events that may happen due to a change in the 304 physical state of the network, and yet short enough to 305 avoid delaying the update of the affinity sub-TLV. 307 iv. At the expiration of the timer, the existing state of the tree 308 should be compared with the existing affinity sub-TLV and the 309 intended change in the status of the affinity sub-TLV is 310 carried out e.g. an update to the list of children, or a 311 retraction. 313 v. Alternately, the above steps (re-examination of the affinity 314 sub-TLV and update) may be tied to/triggered from the download 315 of the tree routes to the RIB, since that typically happens 316 upon a successful computation of the complete tree. An 317 additional stabilization timer could be used to counteract 318 back-to-back computations of the tree due to a burst of 319 network events. 321 vi. Note that this approach may cause an additional tree computation 322 at remote nodes once the updated affinity sub-TLV (or lack of 323 it) is received/perceived, beyond the network events which led 324 up to the change in the tree. 326 vii. The preferred parent-node should not originate an affinity 327 sub-TLV if it sees an affinity sub-TLV from some other node 328 for the same tree number and for some or all of the same 329 child-nodes, but only if the other node's affinity sub-TLV would 330 win using the conflict tie-break rules in section 5.3 of 331 [RFC7783]. Any existing affinity sub-TLV already published by 332 this node in such a situation should be retracted. 334 d. Remote nodes should compute their trees honoring the latest 335 affinity sub-TLV or the lack thereof, sent from this node. 337 e. Situations where the node advertising the Affinity sub-TLV dies 338 or restarts should be handled using the normal handling for such 339 scenarios relating to the parent Router Capability TLV, and as 340 specified in [RFC4971]. 342 f. Conflicts in the Affinity sub-TLV from different originators for 343 the same tree number and child node should be handled as 344 specified in section 5.3 of [RFC7783]. 346 g. Situations where a link constantly flaps, should be handled 347 by having the preferred parent node retract the affinity 348 sub-TLV, if it affects the parent child relationships in 349 consideration. The long-term state of the affinity sub-TLV can 350 be monitored by the preferred parent node to see if it is being 351 published and retracted repeatedly in multiple iterations or 352 if a specific set of children are being constantly added and 353 removed. The preferred parent may resume publication of the 354 affinity sub-TLV once it perceives the network to be stable 355 again in the future. 357 If the node is forced to retract its affinity sub-TLV due to a change, 358 in the tree structure, it can then repeat these steps in a subsequent 359 tree construction, if the same node becomes a parent again, so long 360 as it perceives the network to be stable (free of link/node flaps). 361 In terms of nodes that do not support this algorithm, they are 362 expected to seamlessly inter-operate with this scheme, so long as 363 they understand and honor the affinity sub-TLV. The draft assumes 364 that most implementations now support the affinity sub-TLV. 366 5. Alternative B - Using a variant of the Djikstra SPF algorithm. 368 While alternative A may be preferable, there may be another way in 369 which the same goal can be achieved - the classic SPF algorithm may be 370 modified to address the above problem. However, this approach should 371 be regarded as experimental, and is not advocated for use in a 372 production environment until further evaluation is possible. The 373 approach requires all nodes in the network to use the same modified 374 SPF algorithm and same tie-break policy, and may work best in small 375 sized networks. 377 When pulling vertices from TENT to PATH, if a given vertex can 378 can be pulled in by more than one parent at the same optimal cost, 379 the modified SPF algorithm presented below allows a policy filter to 380 be placed during the parent selection process, which can identify a 381 preferred parent for the vertex being pulled in. 383 Initially, tree computation starts off with the Trill parent 384 selection rules. For subsequent tree computations, the policy function 385 references stable snap-shots of the prior tree computation and tries to 386 match existing parent-child relationships on a best-effort basis. 388 The algorithm models the SPF algorithm in traditional terms, 389 referencing two sets of vertices, PATH, and TENT. PATH contains 390 vertices known to be reachable at optimal cost from the 391 root vertex v. TENT contains vertices being examined for optimal 392 reachability. TENT contains satellite information for each vertex, 393 which is the immediate parent that caused the vertex to be moved into 394 TENT, and the cost of the link between them. The algorithm is 395 presented in a pseudocode that is similar in syntax to the 'C' 396 language. 398 399 policy_parent_selection_optimized_spf_run(vertex root, 400 PolicyFunction PF, 401 Tree PreviousTree) 402 /* representation of the 403 * same tree as computed 404 * in the previous iteration, 405 * as a reference for the 406 * policy tie-breaker 407 * function below. 408 */ 409 { 411 /* L, LL are lists of tuples of the following form */ 412 List<(vertex, parent, link(parent, vertex), 413 link-cost(parent, vertex))> L, LL; 415 /* Heap is ordered in ascending order of link-costs */ 416 HEAP tmpHEAP; 418 /* 419 * map_vertex2parents_heap maps a vertex to a HEAP of tuples, each 420 * identifying a parent relationship. The heap is ordered in 421 * ascending order of link-cost. It also serves as a representation 422 * of the set TENT. 423 */ 425 Mapping map_vertex2parents_heap{vertex -> HEAP of tuples of type 426 (vertex, parent, 427 link(parent,vertex) 428 link-cost(parent, vertex)) }; 429 initialize_to_empty(L); 431 initialize_to_empty(LL); 433 initialize_to_empty(map_vertex2parents_heap); 435 set distance_from_root(root) = 0; 436 root.in_PATH = TRUE; 437 for each vertex in graph { 438 if vertex is not root then { 439 set distance_from_root(vertex) = infinity; 440 vertex.in_PATH = FALSE; 441 } 442 } 444 for each link (root,w) that is attached to root { 445 if (distance_from_root(w) > link-cost(root, w)) { 446 map_vertex2parents_heap{w}.add_to_heap( 447 (w, root, link(root,w), link-cost(root,w))); 448 } 449 } 451 /* 452 * Iterate while there are vertices (tuples) in TENT - TENT is 453 * realized via map_vertex2parents_heap which is a mapping table 454 * with each entry of the table representing a (per-vertex) HEAP 455 * of tuples. Each tuple represents a link i.e. a parent-child 456 * relationship along with the identity of the link and its 457 * associated cost. 458 */ 459 while (not_empty(map_vertex2parents_heap)) { 461 /* There may be more than one vertex in TENT at 462 * minimal cost, and this modified SPF algorithm considers all 463 * of them simultaneously, processing them as a list, in terms 464 * of their addition to PATH, and in terms of equally 465 * considering them parents of as-yet undiscovered vertices 466 * that may be reachable at the same total cost from members 467 * of the list. 468 * priority_heap_minimum_multiple_dequeue_as_list takes the 469 * heap and returns a list of all the vertices that are at 470 * minimum cost from v. The list is returned as a list of 471 * tuples, where each tuple is of the form: 472 * 473 * (vertex, parent, link(parent, vertex), 474 * link-cost(vertex, parent)) 475 */ 477 initialize_to_empty(tmpHEAP); 479 /* 480 * Iterate over the vertices comprising the key-space of 481 * map_vertex2parents_heap. The minimum from TENT is 482 * derived in two steps - find the minimum cost for each vertex 483 * from its potential parents (find the tuples), and then find 484 * the minimum of the above minimums using a temporary HEAP. 485 */ 486 foreach (vertex v in valid-keys(map_vertex2parents_heap)) { 487 /* tuples of the type 488 * (vertex, parent,link(vertex, parent), 489 * link-cost(vertex, parent)) at a minimum cost per hash 490 * table vertex entry are inserted to tmpHEAP. Each entry 491 * in tmpHEAP is a tuple. 492 */ 493 if (is-empty(map_vertex2parents_heap{v}) 494 continue; /* Move on to the next vertex */ 496 /* 497 * heap_dequeue_multiple_minimum_as_list() is assumed to 498 * dequeue multiple entries at the top of the heap, if 499 * they are all at the same minimum value of the heap 500 * metric (cumulative reachability cost for this 501 * application). 502 */ 503 tmpHEAP.add_to_heap(heap_dequeue_multiple_minimum_as_list( 504 map_vertex2parents_heap{v})); 505 } 507 L = heap_dequeue_multiple_minimum_as_list(tmpHEAP); 509 /* 510 * L is a list of tuples. 511 * L could potentially have only a single tuple. Iterate over 512 * tuples in L. Tuples in L are optimally reachable from nodes 513 * already in PATH. 514 */ 515 foreach tuple (y, y_parent, link(y_parent, y), 516 link-cost(y_parent, y)) in L { 518 /* 519 * Identify tuples relating to y's potential parents, 520 * all leading to optimal reachability for y, collect 521 * them in LL. 522 */ 523 LL = 524 heap_dequeue_multiple_minimum_as_list( 525 map_vertex2parents_heap{y}); 527 /* 528 * Policy function below takes a vertex, and its list of 529 * eligible parents, and chooses one parent for the vertex. 530 * It can be simply thought of as a tie-breaker that 531 * chooses one parent out of the list of eligible parents 532 * of y, but based on the consideration specified by the 533 * policy, and being able to do so in a way that 534 * minimizes tree churn. 535 * 536 * ASSUMPTION: Policy Function needs to be able 537 * to deal with trivial case of the list of eligible 538 * parents having only one parent, it must select that one 539 * parent unconditionally in that case for the given child 540 * node. 541 * 542 * PolicyFunction is not explicitly defined here, it may 543 * take other parameters such as a representation of the 544 * previous computation of the tree to help it make a 545 * selection that helps preserve links in the tree as they 546 * existed prior to this computation. 547 */ 549 if (y_parent == 550 PolicyFunction(y, LL, PreviousTree, TreeNumber)) { 552 /* 553 * if y is allowed by the policy function, it can get 554 * pulled into PATH here. However, if y is excluded 555 * at this stage, then it must have another potential 556 * instance in TENT (but see assumption above), which 557 * will pull it in thru a different parent - 558 * correctness of the SPF algorithm guarantees this, 559 * and this other instance with other parent must be 560 * in the list L. 561 */ 563 y.in_PATH = TRUE; 564 distance_from_root(y) = distance_from_root(y_parent) 565 + link-cost(y_parent,y); 566 if (y_parent == root) 567 /* directly connected */ 568 nexthop_link_at_root(y) = link (root ,y); 569 else 570 nexthop_link_at_root(y) = 571 nexthop_link_at_root(y_parent); 573 /* y is now in PATH. 574 * identify_vertex_as_parent() is defined further 575 * down below, it marks y as parent for its 576 * downstream children, by updating the 577 * map_vertex2parents_heap entry for nodes 578 * immediately downstream of y. Note 579 * that we do not need to iterate over all instances 580 * of y in LL - this can be done by simply examining 581 * all links on y and pulling in those not in PATH. 582 */ 584 identify_vertex_as_parent(y, map_vertex2parents_heap); 585 } 586 } 587 } 589 for each vertex w in graph { 590 if (w is not root) { 591 print (distance_from_root(w), nexthop_link_at_root(w)); 592 } 593 } 594 } 596 identify_vertex_as_parent(vertex y, Mapping map_vertex2parents_heap) 597 { 598 for each link (y,x) that is attached to y { 599 if (x.in_PATH == FALSE) // Ignore nodes already in PATH. 600 { 601 /* 602 * Routine is only called on vertices y that are already 603 * in PATH, so child x will be reachable optimally from y. 604 * 605 * y is a potential parent of x, because x is optimally 606 * reachable from y. But we need to see if there are 607 * other potential parents of x at the same optimal cost, 608 * the policy function will help pick the right parent, 609 * during its invocation in the SPF run for x's selection 610 * into path. 611 * 612 */ 613 if (distance_from_root(x) > (distance_from_root(y) 614 + cost(y,x)) { 615 map_vertex2parents_heap{x}.add_to_heap(x, y, (y,x), 616 distance_from_root(y)+cost(y,x)); 617 } 618 } 619 } 620 } 621 623 5.1 Choice of the policy function. 625 In order to solve the discussed problem, the policy function must 626 preserve parent-child links as they existed in previous stable 627 (non-transient) computations of the tree and it can do this by 628 consulting a a stable snapshot of the previous tree computation, 629 to identify and preserve parent-child relationships in the prior 630 SPF tree. Text further below discusses how a stable snap-shot of 631 the tree may be collected. 633 The policy is applied on a best-effort basis and is consulted 634 during the normal SPF tree construction mechanism if a given child 635 node can be pulled in from a choice of multiple parents, and if one 636 of those parents had pulled in that child node in a previous 637 stable snap-shot of the tree. 639 When used in this manner, if a parent node in the previous 640 tree computation is no longer alive/reachable, it should not be used 641 as a reference in preserving any parent-child relationships. A 642 secondary policy or fall-back tie-break may be needed to identify 643 a parent for a given child in this situation, and this is detailed 644 below. Also, as noted in the code section, in the case where a child 645 node has only one parent, the policy function must unconditionally 646 select that parent node to pull in the child to the SPF tree. The 647 policy function may also take the tree number as an input 648 parameter to determine a fallback tie-break. 650 When the policy function is set up in the manner described above, 651 it helps maintain parent affinity by explicitly breaking ties 652 preferentially so that prior parent relationships are maintained in a 653 new computation of the tree. However, the challenge of synchronizing 654 the application of the policy across rBridges in the network needs to 655 be addressed. In order to do this, the following rules may need to be 656 adhered to when using this approach: 658 a. The first tree computation is carried out using the Trill 659 specified tree construction mechanism. 661 b. So long as the root node for that tree number does not 662 change physically, the second and subsequent tree computations 663 for the same tree can use this algorithm and try to feed in a 664 representation of the previous steady-state tree computation as 665 a guide to the policy function, subject to the following caveats: 667 i. if a non-root node that was a parent in a previous computation 668 is no longer reachable/alive, then the computation should fall 669 back to the default Trill parent selection rule for the set of 670 remaining parent candidates and the associated children. Note 671 that if a sibling of the parent node goes away or changes its 672 link connectivity, it does not impact the concerned parent's 673 child relationships. 675 ii. If a link between a parent node and child node (in the previous 676 computation of that tree) goes down, then that child alone should 677 be pulled in to the SPF tree using vertices that are upstream of 678 it during tree computation, subject to the Trill specified 679 tie-break rule. Note that this may result in some children being 680 pulled in through the old parent and some pulled in through a 681 different node, subject to the Trill tie-break rule. 683 c. If the root node for that tree number changes to a physically 684 different node, then the first tree computation for that tree number 685 with that new tree root should be carried out using the default 686 Trill parent-selection rules. The second and subsequent computations 687 can leverage this approach, each feeding in a representation of the 688 previous stable snap-shot as a reference for the current computation. 690 d. There may be cases during tree construction with this approach 691 where more than one parent finds a match in the representation 692 of the previous tree - in this case the tie should be broken 693 according to the default Trill parent selection rule. This can 694 happen, when a node that was a parent in a previous computation 695 becomes a child node of its former child in the current tree, 696 due to a link going down, for example. In this case, the 697 directional nature of the (parent, child) link in the prior 698 snapshot may need to be ignored while trying to find a match in 699 the prior snapshot. 701 e. The policy function must always be given the representation of the 702 most recent stable snapshot of the prior tree computation for that 703 tree number. Ideally, a stable snap-shot should reflect a converged 704 tree state after all recent network churn events have been absorbed. 705 A timer may be started when any tree computation starts. The timer 706 would need to be long enough to capture a converged tree state. 707 Any interim network events received while the timer is running 708 extend the timer. When the timer expires the tree is deemed to have 709 converged, and a snap-shot of the tree may be collected. 711 f. Alternatively, snap-shot collection may be triggered when the tree 712 routes are being programmed in the RIB, since most implementations 713 would download routes to the RIB only when the tree calculation has 714 completed successfully. An additional stabilization timer may be 715 used, after the RIB trigger, to capture cases where multiple tree 716 computations run back to back. The snapshot must collect all links 717 in the tree, reflecting parent child relationships in the tree. 719 g. To handle situations where a link or node is flapping constantly, 720 nodes in the network should turn off snap-shot matching on the 721 child nodes connected to the affected link or node 722 (if the node happens to be a parent node, then snapshot matching 723 should be turned of for all the children of that node) and fallback 724 to default Trill parent selection rules for the affected children. 725 Nodes in the network may examine parent-child relationships in the 726 successive collected snapshots to see if a specific sub-tree is 727 toggling within a pre-configured interval of time. 729 6. Network wide selection of computation algorithm. 731 Alternative A does not need any operational change to the 732 Trill protocol, beyond the usage of the affinity sub-TLV (which is 733 already in the proposed standard) for the use case identified in 734 this draft. 736 For alternative B, this draft takes no position on how rBridges in 737 the network may negotiate and select the modified algorithm as the 738 preferred tree computation mechanism at this time. 740 7. Relationship to draft-ietf-trill-resilient-trees. 742 Given that both draft-ietf-trill-resilient-trees, and 743 draft-rp-trill-parent-selection-02 drafts use the affinity sub-TLV, 744 it is worthwhile to examine if there is any functional overlap 745 between the two drafts. 747 At a high level, draft-ietf-trill-resilient-trees relates to link 748 protection, and defines the notion of a primary distribution tree 749 and a backup distribution tree (DT), where these trees are 750 intentionally kept link disjoint to the extent possible, and the 751 backup tree is pre-programmed in the hardware, and activated either 752 upfront or upon failure of the primary distribution tree. 754 On the other hand, draft-rp-trill-parent-selection-02 protects 755 parent-child relationships of interest on the primary DT, and has 756 no direct notion of a backup DT. 758 draft-ietf-trill-resilient-trees considers the following algorithmic 759 approaches to the building the backup distribution tree (section 760 numbers listed are from that draft): 762 1. Pure operator configuration for links on the backup DT/manual 763 generation of affinity sub-TLV - this is very tedious and unlikely 764 to scale or be implemented in practice, and hence is disregarded 765 in the analysis here. 767 2. Section 3.2.1.1a: Use of MRT algorithms (which will produce conjugate 768 trees - link disjoint trees with roots for primary and backup trees 769 that are coincident on the same physical rBridge). 771 3. Section 3.2.1.1b: Once the primary DT is constructed, the links 772 used in the primary DT are additively cost re-weighted, and a 773 second SPF is run to derive the links comprising the backup DT. 774 Affinity sub-TLV is used to mark links on the back-up DT which are 775 not also on the primary DT. This approach can handle conjugate 776 trees as well as non-conjugate trees (link disjoint trees that are 777 rooted at different physical nodes). 779 4. Section 3.2.2: A variation on the section 3.2.1.1b approach, but 780 without affinity sub-TLV advertisement. Once the primary DT is 781 constructed, costs for links on the primary DT are multiplied by a 782 fixed multiplier to prevent them from being selected in a 783 subsequent SPF run, unless there is no other choice, and the 784 subsequent SPF yields links on the backup DT. 786 All of the approaches above yield maximally link disjoint trees, 787 when applied as prescribed. 789 Approach 4 above does not seem to use affinity sub-TLVs and instead 790 seems to depend upon a network wide agreement on the alternative 791 tree computation algorithm being used. 793 Approaches 2 and 3 use affinity sub-TLV on the backup DT, for links 794 that are not already on the primary DT. The primary DT does not 795 appear to use affinity sub-TLVs. Additionally, from an end-to-end 796 perspective the backup DT comes into picture when the primary DT 797 fails (this is effectively true even in the 1+1 protection mechanism 798 and in the local protection case), and then again, only until the 799 primary DT is recalculated. Once the primary DT is recalculated, the 800 backup DT is recalculated as well, and can change corresponding to 801 the new primary DT. 803 draft-ietf-trill-resilient-trees cannot directly prevent/mitigate a 804 parent node shift on the primary DT at a given parent node, and while 805 usage of the affinity sub-TLV on the backup DT might confer a parent 806 affinity on some nodes on the backup DT, these are not necessarily 807 the nodes on which the network operator may want/prefer an explicit 808 parent affinity. Further, the backup DT is only used on a transient 809 basis, from a forwarding perspective, until the primary DT is 810 recomputed. 812 In situations involving a node failure, there is no direct functional 813 overlap between draft-ietf-trill-resilient-trees, and the 814 draft-rp-trill-parent-selection-02 draft. The two drafts have different 815 goals and appear to solve unrelated problems, in this situation. 817 Sometimes, a parent shift can be triggered by link failure. In a 818 situation where both drafts are active in the implementation, failure 819 of a specific link may cause the backup DT to kick in, but when the 820 primary DT is re-calculated, draft-rp-trill-parent-selection-02 can be 821 used to preserve parent-child relationships on the primary DT, to the 822 extent possible, during the re-calculation. So, in this case also, 823 there does not appear to be a direct functional overlap in the 824 simultaneous usage of these drafts. 826 8. Security Considerations. 828 The proposal primarily influences tree construction and tries to 829 preserve parent-child relationships in the tree from prior computations 830 of the same tree, without changing any of operational aspects of the 831 protocol. Hence, no new security considerations for Trill are raised 832 by this proposal. 834 9. IANA Considerations. 836 No new registry entries are requested to be assigned by IANA. The 837 Affinity Sub-TLV has been defined in [RFC7176], and this proposal 838 does not change its semantics in any way. 840 10. Informative References. 842 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 843 Requirement Levels", BCP 14, RFC 2119, 844 DOI 10.17487/RFC2119, March 1997, 845 . 847 [RFC6325] Perlman, R., Eastlake 3rd, D., Dutt, D., Gai, S., and A. 848 Ghanwani, "Routing Bridges (RBridges): Base Protocol 849 Specification", RFC 6325, DOI 10.17487/RFC6325, July 2011, 850 . 852 [RFC7780] - Eastlake 3rd, D., Zhang, M., Perlman, R., Banerjee, A., 853 Ghanwani, A., and S. Gupta, "Transparent Interconnection of 854 Lots of Links (TRILL): Clarifications, Corrections, and 855 Updates", RFC 7780, DOI 10.17487/RFC7780, February 2016, 856 . 858 [RFC7783] Senevirathne, T., Pathangi, J., Hudson, J., "Coordinated 859 Multicast Trees (CMT) for Transparent Interconnection of 860 Lots of Links (TRILL)", RFC 7783, February 2016, 861 863 [RFC4971] Vasseur, JP., Shen, N., Aggarwal, R., "Intermediate 864 System to Intermediate System (IS-IS) Extensions for 865 Advertising Router Information", RFC 4971, July 2007, 866 868 Author's Address: 870 R. Parameswaran, 871 Brocade Communications, Inc. 872 120 Holger Way, 873 San Jose, CA 95134. 875 Email: parameswaran.r7@gmail.com 877 Copyright and IPR Provisions 879 Copyright (c) 2017 IETF Trust and the persons identified as the 880 document authors. All rights reserved. 882 This document is subject to BCP 78 and the IETF Trust's Legal 883 Provisions Relating to IETF Documents 884 (http://trustee.ietf.org/license-info) in effect on the date of 885 publication of this document. Please review these documents 886 carefully, as they describe your rights and restrictions with respect 887 to this document. Code Components extracted from this document must 888 include Simplified BSD License text as described in Section 4.e of 889 the Trust Legal Provisions and are provided without warranty as 890 described in the Simplified BSD License. The definitive version of 891 an IETF Document is that published by, or under the auspices of, the 892 IETF. Versions of IETF Documents that are published by third parties, 893 including those that are translated into other languages, should not 894 be considered to be definitive versions of IETF Documents. The 895 definitive version of these Legal Provisions is that published by, or 896 under the auspices of, the IETF. Versions of these Legal Provisions 897 that are published by third parties, including those that are 898 translated into other languages, should not be considered to be 899 definitive versions of these Legal Provisions. For the avoidance of 900 doubt, each Contributor to the IETF Standards Process licenses each 901 Contribution that he or she makes as part of the IETF Standards 902 Process to the IETF Trust pursuant to the provisions of RFC 5378. No 903 language to the contrary, or terms, conditions or rights that differ 904 from or are inconsistent with the rights and licenses granted under 905 RFC 5378, shall have any effect and shall be null and void, whether 906 published or posted by such Contributor, or included with or in such 907 Contribution.