idnits 2.17.1 draft-hsingh-coinrg-reqs-p4comp-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (December 8, 2020) is 1206 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 COIN H. Singh 3 Internet-Draft MNK Labs and Consulting 4 Intended status: Informational December 8, 2020 5 Expires: June 11, 2021 7 Requirements for P4 Program Splitting for Heterogeneous Network Nodes 8 draft-hsingh-coinrg-reqs-p4comp-02 10 Abstract 12 The P4 research community has published a paper to show how to split 13 a P4 program into sub-programs which run on heterogeneous network 14 nodes in a network. Examples of nodes are a network switch, a 15 smartNIC, or a host machine. The paper has developed artifacts to 16 split program based on latency, data rate, cost, etc. However, the 17 paper does not mention any requirements. To provide guidance, this 18 document covers requirements for splitting P4 programs for 19 heterogeneous network nodes. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at https://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on June 11, 2021. 38 Copyright Notice 40 Copyright (c) 2020 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (https://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Requirements Language . . . . . . . . . . . . . . . . . . . . 2 56 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 57 3. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 3 58 4. Changes to P4 Compiler to Block Split . . . . . . . . . . . . 4 59 5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 4 60 6. Security Considerations . . . . . . . . . . . . . . . . . . . 5 61 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 5 62 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 5 63 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 5 64 9.1. Normative References . . . . . . . . . . . . . . . . . . 5 65 9.2. Informative References . . . . . . . . . . . . . . . . . 6 66 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 6 68 1. Requirements Language 70 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 71 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 72 document are to be interpreted as described in RFC 2119 [RFC2119]. 74 2. Introduction 76 The research paper [FLY] covers splitting a P4 program into sub- 77 programs to run the sub-programs on heterogeneous network nodes. 78 There are certain issues to discuss first because some P4 code cannot 79 be split to run elsewhere. There are other issues as well. For 80 brevity, this document uses the terms smartNIC and NIC 81 interchangeably. 83 In a data center, host machines are connected to a switch. In an 84 Enterprise network, P4 data plane replicates ARP [RFC0826] and IPv6 85 ND [RFC4861] messages for layer-2 address resolution. If a program 86 split moves ARP and IPv6 code to smartNIC, the hosts should also move 87 to smartNIC. If hosts do not move, the switch resolves layer-2 88 destinations and messages the NIC with ARP or IPv6 ND table update. 89 But the switch is forwarding traffic at 12 Tbps and for any layer-2 90 lookup, the switch has to message the NIC which slows down switch 91 forwarding. If hosts also move with ARP and IPv6 ND to the NIC, 92 there are still issues. A NIC with two 100G ports will not be able 93 to support all 25G hosts on a switch with 32 ports. So multiple NICs 94 are used. If a switch is used in bridged mode, there is a single 95 link-local domain for ARP and IPv6 ND. If the switch is used as a 96 layer-3 switch, then one interface with layer-3 addresses can operate 97 the switch. With multiple NICs, each NIC has its own link-local 98 domain and if configured, a layer-3 interface. So hosts on one NIC 99 go through an additional router to communicate with hosts on another 100 NIC. On the switch, running in bridged mode, the router is not 101 needed. 103 In a public cloud, Azure resolves layer-2 destination with a central 104 controller and thus the switch does not use any data plane broadcast 105 or IPv6 ND multicast addresses. However, this network faces the same 106 issue mentioned above when multiple NICs are used. Google resolves 107 layer-2 via a proprietary Neighbor Discover protocol [GOOG]. How 108 does Flightplan [FLY] deal with three such disparate networks? 110 Regarding BGP, if a CLOS (leaf and spine switch redundant topology) 111 network runs BGP, BGP operates between LEAF and SPINE switches. If 112 BGP data plane table splits to a smartNIC, you have to assign an IP 113 address for BGP peer on host CPU. Now the host CPU runs BGP control 114 plane and NIC stores BGP data plane tables. But both Azure and AWS 115 (Amazon Web Services) do not run any SDN or BGP control plane on host 116 because such network activity steals key cycles from host CPU. There 117 is another major problem. Hosts routinely move in the data center to 118 load balance. With a host move, the BGP peer may move to a totally 119 different subnet and break the BGP network. 121 The punt or divert path of a data plane processes ARP, IPv6 ND, and 122 any routing control messages. Production quality switches (or 123 routers) also run a punt rate-limiter in the data plane so that the 124 switch/router CPU is not inundated. In a heterogeneous network, it 125 is not just how close one punts packets to CPU, but also what else 126 moves with punt path? Certainly the data plane punt rate-limiter 127 also moves. 129 3. Requirements 131 The requirements are: 133 1. If the heterogeneous network includes a switch, the ARP and IPv6 134 ND data plane P4 code should not be split to run outside the 135 switch. 137 2. Likewise ARP or IPv6 ND Proxy data plane code should not be split 138 to run outside the switch. 140 3. BGP table should not be split and move outside the switch. 141 Distributed BGP is a research topic. 143 4. A switch likely includes TCAM (ternary content-addressable 144 memory) and thus the P4 program may use P4 ternary table match 145 kind. If such a table is moved to another node due to program 146 split, the node the code moves to is important. A FPGA (field- 147 programmable gate array) does not use TCAM and a host machine may 148 not either. The FPGA and host use hash-based table lookup. 149 Depending on the table key size, an appropriate hash is required. 150 Either the splitting tool prompts the user for what hash to use 151 or deduces what hash - user input is desirable. For example, for 152 a 6-tuple IPv4 key, a 128 bit key is used and for the same 153 6-tuple, the IPv6 key uses 320 bits. Appropriate hashes are 154 required for such keys. 156 5. Splitting algorithms should not develop High Availability. 157 Network deployments already use dual switches, or CLOS topology 158 for redundancy. BFD [RFC5880] is recommended for use with 159 liveliness detection. 161 6. Any automated tool that splits a P4 program to run on 162 heterogeneous nodes, should provide a manual override. For 163 example, a P4 program is compiled for a switching asic. The 164 compiler raises an error saying code fits in N+2 pipeline stages 165 but the asic has only N stages. In this case, an automated tool 166 will just split the program. However, a manual override allows 167 the programmer to tweak the code manually to fit. With manual 168 tweaking I have been able to fit code in N-1 stages after getting 169 an initial error from compiler for code using N+2 stages. Manual 170 override could kick in if the number of stages used is (N + 16% x 171 N). 173 7. The splitting tool should define clearly what is the punt path 174 for P4 code running on a host. The reason is because the host 175 CPU is the data plane, so where is the punted packet to CPU sent? 176 For DPDK, I expect Linux user space to receive punted packets. 177 For VPP (Vector Packet Processing), VPP supports a punt node. 179 4. Changes to P4 Compiler to Block Split 181 Using P4 Annotations to pass information to p4c (P4 compiler) backend 182 [P4C] to not split certain code is not desirable. This document 183 proposes to change p4c. A new table implementation property called 184 no-split is added to p4c. If this no-split table implementation 185 property is configured for a table in the P4 program, then the table 186 and its actions and any table invocation code block are not split. 188 5. Discussion 190 The two largest public cloud operators are Amazon AWS and Microsoft 191 Azure [NIC]. Both operators run Software Defined Networking (SDN) in 192 the smartNIC (smart Network Interface Card). The reason is running 193 SDN stack in software on the host requires additional CPU cycles. 194 Burning CPUs for SDN services takes away from the processing power 195 available to customer VMs, and increases the overall cost of 196 providing cloud services. Azure uses a FPGA on smartNIC and programs 197 the FPGA in Verilog, not P4. Amazon uses multi-core npu (Graviton 198 uses 64 cores) on smartNIC and does not program Graviton in P4. Both 199 these operators do not use host CPU or network switch for SDN 200 operations. In future, even if both operators program smartNIC in 201 P4, the operators do not have heterogeneous nodes running SDN. 202 Likewise, in future, the switch runs a new SDN feature, e.g. switch 203 caching of popular lookup, then there are heterogeneous nodes to 204 apply Flightplan to. 206 6. Security Considerations 208 Use IPSec [RFC4301] to secure any control plane communications. 210 7. IANA Considerations 212 None. 214 8. Acknowledgements 216 Thanks (in alphabetical order by first name) to Nik Sultana for 217 reviewing this document. 219 9. References 221 9.1. Normative References 223 [RFC0826] Plummer, D., "An Ethernet Address Resolution Protocol: Or 224 Converting Network Protocol Addresses to 48.bit Ethernet 225 Address for Transmission on Ethernet Hardware", STD 37, 226 RFC 826, DOI 10.17487/RFC0826, November 1982, 227 . 229 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 230 Requirement Levels", BCP 14, RFC 2119, 231 DOI 10.17487/RFC2119, March 1997, 232 . 234 [RFC4301] Kent, S. and K. Seo, "Security Architecture for the 235 Internet Protocol", RFC 4301, DOI 10.17487/RFC4301, 236 December 2005, . 238 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 239 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 240 DOI 10.17487/RFC4861, September 2007, 241 . 243 [RFC5880] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 244 (BFD)", RFC 5880, DOI 10.17487/RFC5880, June 2010, 245 . 247 9.2. Informative References 249 [FLY] Sultana, N., Sonchack, J., Giesen, H., Pedisich, I., Han, 250 Z., Shyamkumar, N., Burad, S., DeHon, A., and B. T. Loo, 251 "Flightplan: Dataplane Disaggregation and Placement for P4 252 Programs", November 2020, 253 . 255 [GOOG] Singh, A., "Jupiter Rising: A Decade of Clos Topologies 256 and Centralized Control in Google Datacenter Network", 257 September 2016, 258 . 262 [NIC] Firestone, D., "Azure Accelerated Networking: SmartNICs in 263 the Public Cloud", April 2018, . 267 [P4C] Community, P., "P4_16 Reference Compiler - Github", May 268 2018, . 270 Author's Address 272 Hemant Singh 273 MNK Labs and Consulting 274 7 Caldwell Drive 275 Westford, MA 01886 276 USA 278 Email: hemant@mnkcg.com 279 URI: https://mnkcg.com/