<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="std" docName="draft-xie-mboned-bier-entropy-staged-dc-clos-00"
     ipr="trust200902">
  <front>
    <title abbrev="Use of BIER Entropy for DC CLOS Networks">Use of BIER
    Entropy for Data Center CLOS Networks</title>

    <author fullname="Jingrong Xie" initials="J." surname="Xie">
      <organization>Huawei Technologies</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <code/>

          <country/>
        </postal>

        <email>xiejingrong@huawei.com</email>
      </address>
    </author>

    <author fullname="Xiaohu Xu" initials="X." surname="Xu">
      <organization>Alibaba Inc.</organization>

      <address>
        <postal>
          <street/>
        </postal>

        <email>xiaohu.xxh@alibaba-inc.com</email>
      </address>
    </author>

    <author fullname="Gang Yan" initials="G." surname="Yan">
      <organization>Huawei Technologies</organization>

      <address>
        <postal>
          <street/>
        </postal>

        <email>yangang@huawei.com</email>
      </address>
    </author>

    <author fullname="Mike McBride" initials="M." surname="McBride">
      <organization>Huawei Technologies</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <code/>

          <country/>
        </postal>

        <email>mmcbride7@gmail.com</email>
      </address>
    </author>

    <date day="2" month="July" year="2018"/>

    <abstract>
      <t>Bit Index Explicit Replication (BIER) introduces a new
      multicast-specific BIER Header. BIER can be applied to the Multi
      Protocol Label Switching (MPLS) data plane or Non-MPLS data plane.
      Entropy is a technique used in BIER to support load-balancing. This
      document examines and describes how BIER Entropy is to be applied to
      Data Center CLOS networks for path selection.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119"/>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>Bit Index Explicit Replication (BIER) <xref target="RFC8279"/> is an
      architecture that provides optimal multicast forwarding without
      requiring intermediate routers to maintain any per-flow state by using a
      multicast-specific BIER header. <xref target="RFC8296"/> defines two
      types of BIER encapsulation formats: one is MPLS encapsulation, the
      other is non-MPLS encapsulation. Entropy is a technique used in BIER to
      support load-balancing. This document examines and describes how BIER
      Entropy is to be applied to Data Center CLOS networks for path
      selection.</t>
    </section>

    <section title="Terminology">
      <t>Readers of this document are assumed to be familiar with the
      terminology and concepts of the documents listed as Normative
      References.</t>
    </section>

    <section title="Problem Statement and Considerations">
      <t/>

      <section title="Problem Statement">
        <t>A common choice for a horizontally scalable topology used in Data
        Center is a Clos topology. This topology features an odd number of
        stages, for example, a 5-Stage Clos Topology as a example in <xref
        target="RFC7938"/>.</t>

        <t>ECMP is the fundamental load-sharing mechanism used by a Clos
        topology. Effectively, every lower-tier device will use all of its
        directly attached upper-tier devices to load-share traffic destined to
        the same IP prefix. The number of ECMP paths between any two Tier 3
        devices in Clos topology is equal to the number of the devices in the
        middle stage (Tier 1). For example, Figure 1 illustrates a topology
        where Tier 3 device L1 has four paths to reach servers X and Y, via
        Tier 2 devices S1 and S2 and then Tier 1 devices S11, S12, S21, and
        S22, respectively.</t>

        <t><figure align="left" anchor="IPv6-Dest-Option-BIER"
            title="5-Stage Clos Topology">
            <artwork><![CDATA[      
                                      Tier 1
                                     +-----+
          Cluster                    |SUPER|
 +----------------------------+   +--| S11 |--+
 |                            |   |  +-----+  |
 |                    Tier 2  |   |           |   Tier 2
 |                   +-----+  |   |  +-----+  |  +-----+
 |     +-------------|SPINE|------+--|SUPER|--+--|SPINE|-------------+
 |     |       +-----|  S1 |------+  | S12 |  +--|  S3 |-----+       |
 |     |       |     +-----+  |      +-----+     +-----+     |       |
 |     |       |              |                              |       |
 |     |       |     +-----+  |      +-----+     +-----+     |       |
 |     | +-----------|SPINE|------+  |SUPER|  +--|SPINE|-----------+ |
 |     | |     | +---|  S2 |------+--| S21 |--+--|  S4 |---+ |     | |
 |     | |     | |   +-----+  |   |  +-----+  |  +-----+   | |     | |
 |     | |     | |            |   |           |            | |     | |
 |   +-----+ +-----+          |   |  +-----+  |          +-----+ +-----+
 |   | LEAF| | LEAF|          |   +--|SUPER|--+          | LEAF| | LEAF|
 |   |  L1 | |  L2 | Tier 3   |      | S22 |      Tier 3 |  L3 | |  L4 |
 |   +-----+ +-----+          |      +-----+             +-----+ +-----+
 |     | |     | |            |                            | |     | |
 |     O O     O O            |                            X Y     O O
 |       Servers              |                              Servers
 +----------------------------+
      ]]></artwork>
          </figure></t>

        <t>When BIER is deployed in a multi-tenant data center network
        environment for efficient delivery of Broadcast, Unknown-unicast and
        Multicast (BUM) traffic, a network operator may want a deterministic
        path for every packet. For example, when L1 needs to send a BUM packet
        to L3 and L4, which are in different SIs, L1 has to send the packet
        twice, and expects the packet along two deterministic paths of
        L1-&gt;S1-&gt;S11--&gt;L3 and L1-&gt;S2-&gt;S21--&gt;L4 seperately.
        Another example of using a deterministic path in a DC is for per-flow
        steering of "elephant" flows defined in <xref
        target="I-D.ietf-spring-segment-routing-msdc"/>.</t>

        <t>A deterministic path for a multicast path, with multiple staged
        equal cost paths, is comparable to a traffic-engineering path defined
        in <xref target="I-D.ietf-mpls-spring-entropy-label"/> for a unicast
        path with multiple hop equal cost paths.</t>

        <t/>
      </section>

      <section title="Considerations">
        <t/>

        <t>The idea behind entropy is that the ingress router computes a hash
        based on several fields from a given packet and places the result in
        an additional label, named "entropy label". Then this entropy label
        can be used as part of the hash keys used by an transit router. When
        entropy label is used, the keys used in the hashing functions are
        still a local configuration matter. A router may soley use the entropy
        label or use a combination of multiple fields from the incoming
        packet. The hashing function is to randomly load balance the mass of
        flows between the small number of equal cost paths.</t>

        <t>If one wants, however, to get a deterministic path from the equal
        cost paths, one can use part of the 20-bit entropy field. For example,
        bit 0 to bit 2 of entropy label can represent a value of 0 to 7, and
        thus can be used to select a deterministic path from 8 equal cost
        paths. And thus, a 20-bit entropy label can be used by routers in
        different tiers to select a deterministic path independently by using
        different parts of the 20-bit entropy label, and form an end-to-end
        deterministic path.</t>

        <t>This is simple and applicable especially for DC CLOS networks,
        because data delivery in DC CLOS networks for tenants is always
        multi-staged, with the upstream direction stages having equal cost
        paths.</t>

        <t/>
      </section>
    </section>

    <section title="Use of BIER Entropy for DC CLOS Network">
      <t/>

      <section title="Use of BIER Entropy for DC CLOS Network">
        <t/>

        <t>Take the 5-stage CLOS network in figure 1 as an example.</t>

        <t>Tier 2 in every cluster has N nodes, and the Tier 1 has M nodes. M
        is equal to N multiplied by P.</t>

        <t>Tier 3 switches, in upstream direction, act as stage 1 of data
        delivery and have N equal cost paths to every BFERs in other clusters.
        Tier 2 switches, in upstream direction, act as stage 2 of data
        delivery and have P equal cost paths to every BFERs in other
        clusters.</t>

        <t>Example 1: One can configure, on each Tier 3 switch, the use of bit
        0 for path selection when N is equal to 2, and configure, on each Tier
        2 switch, to use bit 1 for path selection when P is equal to 2.</t>

        <t>Example 2: One can configure, on each Tier 3 switch, the use of bit
        0 to bit 1 for path selection when N is equal to 4, and configure on
        each Tier 2 switches the use of bit 2 to bit 7 for path selection when
        P is equal to 48.</t>

        <t>Assume that, each Tier 3 and Tier 2 switch the the example have two
        parameters, X and Y, for using part of entropy label to do path
        selection, then in example 2:</t>

        <t><list style="symbols">
            <t>Each of Tier 3 (Stage 1) switches has a pair of parameters
            (X1=1, Y1=4)</t>

            <t>Each of Tier 2 (Stage 2) switches has a pair of parameters
            (X2=X1*Y1=4, Y2=64)</t>

            <t>Each of Tier 3 (Stage 1) switches populates its BIFTs for ECMP,
            for example, BIFT-0 to BIFT-3.</t>

            <t>Each of Tier 2 (Stage 2) switches populates its BIFTs for ECMP,
            for example, BIFT-0 to BIFT-47.</t>
          </list></t>

        <t>For each of Tier 3 (Stage 1) switches, each of the BIFT will have a
        prefered neighboring BFR. For example, LEAF L1 will have a prefered
        neighbor S1/S2 for BIFT-0/1 seperately, and when forming the BIFT-0
        table through the underlay routing to every BFER, the prefered
        neighboring BFR will has a highest priority among all the locally
        available ECMP path. </t>

        <t>Then an end-to-end deterministic path for a BIER packet can be had
        by calculating an entropy label value like this:</t>

        <t><list style="symbols">
            <t>Entropy = (P1-1)*X1 + (P2-1)*X2</t>
          </list></t>

        <t>Where P1 represents one of the Stage 1 equal cost paths with a
        value between 1 and N, and P2 represents one of the Stage 2 equal cost
        paths with a value between 1 and P.</t>

        <t/>
      </section>

      <section title="Steering for elephant flows">
        <t/>

        <t>One can steer an "elephant" flow to an end-to-end deterministic
        path, or some divided end-to-end deterministic paths across different
        SIs.</t>

        <t/>
      </section>

      <section title="Path Division for Tenant flows to different SIs">
        <t/>

        <t>When the VNEs for a tenant span multiple SIs, then it is useful to
        divide the BUM packets paths across different SIs.</t>

        <t>One can configure a policy to use different paths for BIER SIs when
        using BIER as the BUM tunnel, on each VNE for each VNI.</t>

        <t/>
      </section>

      <section title="Link Failure and Convergence">
        <t/>

        <t>As stated above, each of the BIFT on a BFR will have a prefered
        neighboring BFR. But when the link to the prefered neighbor of some
        BIFT (say BIFT-X) fail, BIFT-X will converge normally, and will then
        probably not being the 'best' path. For example, the link between S1
        and L2 fail, then the prefered neighbor of BIFT-0 of LEAF L1, S1, is
        no longer the neighboring BFR for LEAF L2, and the flow using a Entropy
        using LEAF L1's BIFT-0 will have to replicate on L1, one packet to
        S1 for BFER L3 and L4, and one packet to S2 for BFER L2. If the flow changes
        to use a Entropy using LEAF L1's BIFT-1, it will then be
        the 'best' path, because the flow doesn't have to replicate on L1,
        only one to S1 for BFER L2 and L3 and L4. Such a change to a flow's
        entropy is the Ingress switch's responsibility, possibly with the
        assisstance of a controller.</t>

        <t/>
      </section>
    </section>

    <section title="Data-Plane Processing">
      <t/>

      <t>The use of BIER entropy label to select a path between some equal
      cost paths is a local configuration matter. This draft defines a method
      to use part of the 20-bit entropy label in each router, and this needs a
      data-plane to do some bit operation function. It is expected to be
      easier than hashing function.</t>

      <t/>
    </section>

    <section title="Security Considerations">
      <t>This document introduces no new security considerations beyond those
      already specified in [RFC8279] and [RFC8296].</t>

      <t/>
    </section>

    <section title="IANA Considerations">
      <t>This document contains no actions for IANA.</t>

      <t/>
    </section>

    <section title="Acknowledgements">
      <t>TBD.</t>

      <t/>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='reference.RFC.8279'?>

      <?rfc include='reference.RFC.8296'?>

      <?rfc include='reference.RFC.7938'?>

      <?rfc include='reference.RFC.8365'?>

      <?rfc include='reference.I-D.ietf-mpls-spring-entropy-label'?>

      <?rfc include='reference.I-D.ietf-spring-segment-routing-msdc'?>
    </references>

    <references title="Informative References">
      <?rfc include='reference.RFC.2119'?>
    </references>
  </back>
</rfc>
