Gateway Auto-Discovery and Route Advertisement for Segment Routing Enabled Site Interconnection

Data centers (DCs) are critical components of the infrastructure used by network operators to provide services to their customers. DCs are attached to the Internet or a backbone network by gateway routers (GWs). One DC typically has more than one GW for various reasons including commercial preferences, load balancing, or resiliency against connection or device failure. Segment Routing (SR) is a protocol mechanism that can be used within a DC, and also for steering traffic that flows between two DC sites. In order for a source DC (also known as an ingress DC) that uses SR to load balance the flows it sends to a destination DC (also known as an egress DC), it needs to know the complete set of entry nodes (i.e., GWs) for that egress DC from the backbone network connecting the two DCs. Note that it is assumed that the connected set of DCs and the backbone network connecting them are part of the same SR BGP Link State (LS) instance ( and ) so that traffic engineering using SR may be used for these flows. Other sites, such as access networks, also need to be connected across backbone networks through gateways. For illustrative purposes, consider the ingress and egress sites shown in as separate ASes (noting that the sites could be implemented as part of the ASes to which they are attached, or as separate ASes). The various ASes that provide connectivity between the ingress and egress sites could each be constructed differently and use different technologies such as IP, MPLS with global table routing native BGP to the edge, MPLS IP VPN, SR-MPLS IP VPN, or SRv6 IP VPN. That is, the ingress and egress sites can be connected by tunnels across a variety of technologies. This document describes how SR identifiers (SIDs) are used to identify the paths between the ingress and egress sites. The solution described in this document is agnostic as to whether the transit ASes do or do not have SR capabilities. the solution uses SR to stitch together path segments between GWs and through the ASBRs. Thus, there is a requirement that the GWs and ASBRs are SR-capable. The solution supports the SR path being extended into the ingress and egress sites if they are SR-capable. The solution defined in this document can be seen in the broader context of site interconnection in . That document shows how other existing protocol elements may be combined with the solution defined in this document to provide a full system, but is not a necessary reference for understanding this document. Suppose that there are two gateways, GW1 and GW2 as shown in , for a given egress site and that they each advertise a route to prefix X which is located within the egress site with each setting itself as next hop. One might think that the GWs for X could be inferred from the routes' next hop fields, but typically it is not the case that both routes get distributed across the backbone: rather only the best route, as selected by BGP, is distributed. This precludes load balancing flows across both GWs.

The obvious solution to this problem is to use the BGP feature that allows the advertisement of multiple paths in BGP (known as Add-Paths) to ensure that all routes to X get advertised by BGP. However, even if this is done, the identity of the GWs will be lost as soon as the routes get distributed through an Autonomous System Border Router (ASBR) that will set itself to be the next hop. And if there are multiple Autonomous Systems (ASes) in the backbone, not only will the next hop change several times, but the Add-Paths technique will experience scaling issues. This all means that the Add-Paths approach is limited to sites connected over a single AS. This document defines a solution that overcomes this limitation and works equally well with a backbone constructed from one or more ASes using the Tunnel Encapsulation attribute as follows: When a GW to a given site advertises a route to a prefix X within that site, it will include a Tunnel Encapsulation attribute that contains the union of the Tunnel Encapsulation attributes advertised by each of the GWs to that site, including itself. In other words, each route advertised by a GW identifies all of the GWs to the same site (see for a discussion of how GWs discover each other). I.e., the Tunnel Encapsulation attribute advertised by each GW contains multiple Tunnel TLVs, one or more from each active GW, and each Tunnel TLV will contain a Tunnel Egress Endpoint Sub-TLV that identifies the GW for that Tunnel TLV. Therefore, even if only one of the routes is distributed to other ASes, it will not matter how many times the next hop changes, as the Tunnel Encapsulation attribute will remain unchanged. To put this in the context of , GW1 and GW2 discover each other as gateways for the egress site. Both GW1 and GW2 advertise themselves as having routes to prefix X. Furthermore, GW1 includes a Tunnel Encapsulation attribute which is the union of its Tunnel Encapsulation attribute and GW2's Tunnel Encapsulation attribute. Similarly, GW2 includes a Tunnel Encapsulation attribute which is the union of its Tunnel Encapsulation attribute and GW1's Tunnel Encapsulation attribute. The gateway in the ingress site can now see all possible paths to X in the egress site regardless of which route is propagated to it, and it can choose one, or balance traffic flows as it sees fit.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

To allow a given site's GWs to auto-discover each other and to coordinate their operations, the following procedures are implemented: Each GW is configured with an identifier for the site. That identifier MUST be the same across all GWs to the site (i.e., the same identifier is used by all GWs to the same site), and MUST be unique across all sites that are connected (i.e., across all GWs to all sites that are interconnected). A route target () MUST be attached to each GW's auto-discovery route (defined below) and its value MUST be set to the site identifier. Each GW MUST construct an import filtering rule to import any route that carries a route target with the same site identifier that the GW itself uses. This means that only these GWs will import those routes, and that all GWs to the same site will import each other's routes and will learn (auto-discover) the current set of active GWs for the site. The auto-discovery route that each GW advertises consists of the following: An IPv4 or IPv6 Network Layer Reachability Information (NLRI) containing one of the GW's loopback addresses (that is, with an AFI/SAFI pair that is one of IPv4/NLRI used for unicast forwarding (1/1), IPv6/NLRI used for unicast forwarding (2/1), IPv4/NLRI with MPLS Labels (1/4), or IPv6/NLRI with MPLS Labels (2/4)). A Tunnel Encapsulation attribute containing the GW's encapsulation information encoded in one or more Tunnel TLVs. To avoid the side effect of applying the Tunnel Encapsulation attribute to any packet that is addressed to the GW itself, the GW MUST use a different loopback address for packets intended for it. As described in , each GW will include a Tunnel Encapsulation attribute with the GW encapsulation information for each of the site's active GWs (including itself) in every route advertised externally to that site. As the current set of active GWs changes (due to the addition of a new GW or the failure/removal of an existing GW) each externally advertised route will be re-advertised with a new Tunnel Encapsulation attribute which reflects current set of active GWs. If a gateway becomes disconnected from the backbone network, or if the site operator decides to terminate the gateway's activity, it MUST withdraw the advertisements described above. This means that remote gateways at other sites will stop seeing advertisements from this gateway. Note that if the routing within a site is broken (for example, such that there is a route from one GW to another, but not in the reverse direction), then it is possible that incoming traffic will be routed to the wrong GW to reach the destination prefix - in this degraded network situation, traffic may be dropped. Note that if a GW is (mis)configured with a different site identifier from the other GWs to the same site then it will not be auto-discovered by the other GWs (and will not auto-discover the other GWs). This would result in a GW for another site receiving only the Tunnel Encapsulation attribute included in the BGP best route; i.e., the Tunnel Encapsulation attribute of the (mis)configured GW or that of the other GWs.

When a remote GW receives a route to a prefix X, it uses the Tunnel Egress Endpoint Sub-TLVs in the containing Tunnel Encapsulation attribute to identify the GWs through which X can be reached. It uses this information to compute SR Traffic Engineering (SR TE) paths across the backbone network looking at the information advertised to it in SR BGP Link State (BGP-LS) and correlated using the site identity. SR Egress Peer Engineering (EPE) can be used to supplement the information advertised in BGP-LS.

When a packet destined for prefix X is sent on an SR TE path to a GW for the site containing X (that is, the packet is sent in the ingress site on an SR TE path that describes the whole path including those parts that are within the egress site), it needs to carry the receiving GW's SID for X such that this SID rises to the top of the stack before the GW completes its processing of the packet. To achieve this, each Tunnel TLV in the Tunnel Encapsulation attribute contains a Prefix SID sub-TLV for X. As defined in [RFC9012], the Prefix SID sub-TLV is only for IPv4/IPV6 labelled unicast routes, so the solution described in this document only applies to routes of those types. If the use of the Prefix SID sub-tlv for routes of other types is defined in the future, further documents will be needed to describe their use. Alternatively, if MPLS SR is in use and if the GWs for a given site are configured to allow remote GWs to perform SR TE through that site for a prefix X, then each GW computes an SR TE path through that site to X from each of the currently active GWs, and places each in an MPLS label stack sub-TLV in the SR Tunnel TLV for that GW. Please refer to Section 7 of for worked examples of how the SID stack is constructed in this case, and how the advertisements would work.

If the GWs for a given site are configured to allow remote GWs to send them a packet in that site's native encapsulation, then each GW will also include multiple instances of a Tunnel TLV for that native encapsulation in externally advertised routes: one for each GW and each containing a Tunnel Egress Endpoint sub-TLV with that GW's address. A remote GW may then encapsulate a packet according to the rules defined via the sub-TLVs included in each of the Tunnel TLVs.

IANA maintains a registry called "Border Gateway Protocol (BGP) Parameters" with a sub-registry called "BGP Tunnel Encapsulation Attribute Tunnel Types." The registration policy for this registry is First-Come First-Served . IANA previously assigned the value 17 from this sub-registry for "SR Tunnel", referencing this document. IANA is now requested to mark that assignment as deprecated. IANA may reclaim that codepoint at such a time that the registry is depleted.

From a protocol point of view, the mechanisms described in this document can leverage the security mechanisms already defined for BGP. Further discussion of security considerations for BGP may be found in the BGP specification itself and in the security analysis for BGP . The original discussion of the use of the TCP MD5 signature option to protect BGP sessions is found in , while includes an analysis of BGP keying and authentication issues. The mechanisms described in this document involve sharing routing or reachability information between sites: that may mean disclosing information that is normally contained within a site. So it needs to be understood that normal security paradigms based on the boundaries of sites are weakened and interception of BGP messages may result in information being disclosed to third parties. Discussion of these issues with respect to VPNs can be found in , while describes many of the issues associated with the exchange of topology or TE information between sites. Particular exposures resulting from this work include: Gateways to a site will know about all other gateways to the same site. This feature applies within a site and so is not a substantial exposure, but it does mean that if the BGP exchanges within a site can be snooped or if a gateway can be subverted then an attacker may learn the full set of gateways to a site. This would facilitate more effective attacks on that site. The existence of multiple gateways to a site becomes more visible across the backbone and even into remote sites. This means that an attacker is able to prepare a more comprehensive attack than exists when only the locally attached backbone network (e.g., the AS that hosts the site) can see all of the gateways to a site. For example, a Denial of Service attack on a single GW is mitigated by the existence of other GWs, but if the attacker knows about all the gateways then the whole set can be attacked at once. A node in a site that does not have external BGP peering (i.e., is not really a site gateway and cannot speak BGP into the backbone network) may be able to get itself advertised as a gateway by letting other genuine gateways discover it (by speaking BGP to them within the site) and so may get those genuine gateways to advertise it as a gateway into the backbone network. This would allow the malicious node to attract traffic without having to have secure BGP peerings with out-of-site nodes. An external party intercepting BGP messages anywhere between sites may learn information about the functioning of the sites and the locations of end points. While this is not necessarily a significant security or privacy risk, it is possible that the disclosure of this information could be used by an attacker. If it is possible to modify a BGP message within the backbone, it may be possible to spoof the existence of a gateway. This could cause traffic to be attracted to a specific node and might result in black-holing of traffic. All of the issues in the list above could cause disruption to site interconnection, but are not new protocol vulnerabilities so much as new exposures of information that SHOULD be protected against using existing protocol mechanisms such as securing the TCP sessions over which the BGP messages flow. Furthermore, it is a general observation that if these attacks are possible then it is highly likely that far more significant attacks can be made on the routing system. It should be noted that BGP peerings are not discovered, but always arise from explicit configuration.

The principal configuration item added by this solution is the allocation of a site identifier. The same identifier MUST be assigned to every GW to the same site, and each site MUST have a different identifier. This requires coordination, probably through a central management agent. It should be noted that BGP peerings are not discovered, but always arise from explicit configuration. This is no different from any other BGP operation.

In order to limit the VPN routing information that is maintained at a given route reflector, suggests the use of "Cooperative Route Filtering" between route reflectors. defines an extension to that mechanism to include support for multiple autonomous systems and asymmetric VPN topologies such as hub-and-spoke. The mechanism in RFC 4684 is known as Route Target Constraint (RTC). An operator would not normally configure RTC by default for any AFI/SAFI combination, and would only enable it after careful consideration. When using the mechanisms defined in this document, the operator should consider carefully the effects of filtering routes. In some cases this may be desirable, and in others it could limit the effectiveness of the procedures.

Thanks to Bruno Rijsman, Stephane Litkowski, Boris Hassanov, Linda Dunbar, Ravi Singh, and Daniel Migault for review comments, and to Robert Raszuk for useful discussions. Gyan Mishra provided a helpful GenArt review, and John Scudder made helpful comments during IESG review.