idnits 2.17.1 draft-ietf-pktway-protocol-eep-spec-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-24) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 77 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Abstract section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack an Authors' Addresses Section. ** There are 47 instances of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 828 has weird spacing: '...padding paddi...' == Line 920 has weird spacing: '...cquired at ru...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 1998) is 9476 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? '1' on line 449 looks like a reference -- Missing reference section? '2' on line 581 looks like a reference -- Missing reference section? '3' on line 786 looks like a reference -- Missing reference section? '4' on line 834 looks like a reference -- Missing reference section? '5' on line 845 looks like a reference -- Missing reference section? '6' on line 861 looks like a reference Summary: 9 errors (**), 0 flaws (~~), 4 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Woking Group Danny Cohen 3 Internet Draft Myricom 4 Expire in six months Craig Lund 5 Mercury Computers 6 Tony Skjellum 7 Mississippi State University 8 Robert George 9 Mississippi State University 10 Thom McMahon 11 Mississippi State University 12 May 1998 14 The End-to-End (EEP) PacketWay Protocol for 15 High-Performance Interconnection of Computer Clusters 16 18 Status of this Memo 20 This document is an Internet-Draft. Internet-Drafts are working 21 documents of the Internet Engineering Task Force (IETF), its 22 areas, and its working groups. Note that other groups may also 23 distribute working documents as Internet-Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six 26 months and may be updated, replaced, or obsoleted by other 27 documents at any time. It is inappropriate to use Internet- 28 Drafts as reference material or to cite them other than as 29 "work in progress." 31 To view the entire list of current Internet-Drafts, please check 32 the "1id-abstracts.txt" listing contained in the Internet-Drafts 33 Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net 34 (Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au 35 (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu 36 (US West Coast). 38 Table of Content: 40 1. Introduction...................................................2 41 1a. PktWay and IP.................................................2 42 1b. General.......................................................4 43 1c. The Level-2 Operation of PktWay...............................5 44 2. A note about the PktWay documents..............................6 45 3. Notations......................................................7 46 4. PktWay EEP Messages............................................7 47 4a. The PktWay Message Structure..................................7 48 4b. The Optional Fields...........................................8 49 5. Optional Sequence of L2RHs and Symbols.........................9 50 5a. L2 Routing Headers (L2RHs)....................................9 51 5b. Symbols......................................................11 52 6. EEP Header....................................................12 53 6a. Version......................................................12 54 6b. Priority.....................................................12 55 6c. Destination-Type.............................................13 56 6d. Packet Type Extension........................................14 57 6e. Packet Type..................................................14 58 6f. Endianness...................................................15 59 6g. Padding Length...............................................15 60 6h. Data Length..................................................15 61 6i. Options flag.................................................15 62 6j. Reserved.....................................................15 63 6k. Source Address...............................................16 64 7. Optional Header Fields........................................16 65 8. Optional Data Block...........................................17 66 9. Optional Trailer Fields.......................................17 67 10. EEP Trailer...................................................18 68 11. Appendix-A: Recommendation for PktWay Address Assignment......18 69 12. Appendix-B: Glossary..........................................19 70 13. Appendix-C: Acronyms and Abbreviations........................20 71 14. Appendix-D: PktWay at a Glance ("cheat-sheet")................22 72 15. Security Considerations.......................................23 73 16. Editor's Address..............................................23 75 Cohen et al [Page 1] 76 1. Introduction 78 PktWay is an open family of specifications for inter-networking high 79 performance SANs (System Area Networks) and high performance LANs 80 (Local Area Networks) into computing clusters. 82 Most modern SANs have much in common, such as high data rates, low 83 message latency and low bit error rates. Such SANs are often packet 84 networks made of point-to-point links with flow control and source 85 routing. Yet these SANs do not provide heterogeneous networking 86 support, and are subsequently incapable of direct inter- 87 communications with other SANs. PktWay's goal is to provide high 88 performance "internetting" of such SANs and of high performance LANs. 90 The core PktWay protocol comprises the End-to-End Protocol (EEP) and 91 the Router-to-Router-Protocol (RRP). This document specifies the EEP 92 (End-to-End) protocol of PktWay. A companion document 93 ("Specification for the Router-to-Router (RRP) PktWay Protocol") 94 specifies the Router-to-Router protocol of PktWay. 96 Computing clusters and modern MPPs (Massively Parallel Processing 97 systems) are sets of processors interconnected by high performance 98 SANs. Examples are Intel's Paragon and ASCI-red, CRAY's T3D and T3E, 99 and IBM's SP2 and SP3. Most modern SANs have much in common, such as 100 high data rates, low message latency and low bit error rates. Such 101 SANs are often packet networks made of point-to-point links with flow 102 control and source routing. 104 Unfortunately, there is no efficient way to "internet" these SANs - 105 to allow each computing node to have high performance communication 106 directly with any other computing node, in any other interconnected 107 SAN. Hence, there is no way to interconnect such high performance 108 SANs to form as efficient computing cluster as possible. 110 The objective of PktWay is to provide high performance communication 111 among all the processors in a cluster of tightly coupled 112 heterogeneous SANs. PktWay borrows heavily from the experience and 113 wisdom of IP, with a few modifications needed for high performance. 114 PktWay sacrifices generality and scalability to improve performance. 116 1a. PktWay and IP 118 IP is the general solution for "internetting" heterogeneous 119 diverse networks, proven for over 25 years. However, IP was 120 designed for the generality required for Wide Area Networks, 121 without regard to the high performance requirements of tightly 122 coupled systems. In addition, IP was designed to addresses 123 "systems" rather than individual processors in MPPs (as PktWay 124 does). For example, a 9,000 processor system is not expected to 125 be assigned 9,000 IP addresses. 127 Cohen et al [Page 2] 128 PktWay is slightly below IP in the OSI Reference Model. It has 129 many Level-3 features, like IP, but also can support IP as if 130 PktWay was a Level-2 protocol. Hence, it is below IP. In 131 addition, PktWay supports Level-2 optimizations (such as source 132 routing). 134 Like IP, as a heterogeneous network layer, PktWay packets are 135 transported by the native data-link layer of each SAN. As a 136 result, PktWay packets are encapsulated with any native routing 137 headers and trailers as required by the local network fabric. 139 Like IP, PktWay uses routers between its SANs. When an HR 140 (half-router) receives a packet for a destination on its own 141 SAN it forwards that packet directly to its destination. If 142 the packet is for a destination out of this SAN, the HR forwards 143 it to another HR which is en route to that destination. 145 Unlike IP, RRP defines the communication among the HRs both 146 intra-router and inter-router. In the IP environment only the 147 intra-router communication is not defined, only the inter-router 148 communication. 150 Unlike IP, the PktWay routers do not have to pop each packet back 151 to Level-3, and are capable of operating entirely at Level-2, if 152 this operation is requested by the communicating hosts. This 153 Level-2 operation is discussed later in this introduction. 155 Like IP, the PktWay protocol utilizes the native capabilities of 156 its constituent SANs and routers. PktWay defines neither how each 157 HR maps the network in the SAN to which it is attached, nor how 158 each half-router constructs SAN-headers for each of its hosts. 159 The PktWay protocol also does not define how error-checking is 160 conducted by each SAN (e.g., CRC8, CRC32, CRC64, or anything 161 else). Instead, PktWay assumes that these capabilities are native 162 to each SAN, and defines only how these maps are exchanged, and 163 how these error indications are carried from where they were 164 detected, to the destination node. 166 Like IP, the PktWay protocol defines neither how routes are 167 selected, nor what corrective actions should be taken in case of 168 faults. Instead, PktWay provides the information needed by the 169 host nodes for devising routes and detecting and circumventing 170 faults. 172 Like in IP environments, when hosts are powered up they may 173 contact their default half-routers to register themselves and to 174 inquire about other hosts (by name or node capabilities). This 175 registration could be used in support of dynamic discovery 176 procedures. The half-routers may help nodes discover each other 177 (like IP's DNS) and may provide routing alternatives, possibly 178 with different characteristics (e.g., MTU, length, and cost). 179 The PktWay protocol does not specify how to choose among them. 181 Cohen et al [Page 3] 182 To sum it all up, PacketWay has learned many lessons from IP, but 183 has been heavily optimized for high performance SANs, while IP is 184 the protocol of choice for WANs. 186 1b. General 188 PktWay supports resource discovery, by name or capabilities. 190 PktWay's unit of data is 64-bit long (8 bytes). Hence, a PktWay 191 packet is always a multiple of 8B quantities. PktWay provides 192 hosts with padding as required. 194 PktWay iself is big-Endian 8B-word based. Hence, the terms "first 195 bit" and "first byte" are equivalent to MSbit and MSByte. 197 PktWay handles the Little vs. Big-Endian issue for its payload by 198 providing a field in the EEP header which defines the endianness 199 and "the chunk-size" of the data in the payload (Data Block). The 200 intent is that byte-swapping hardware, if any, could be used to 201 invert the endianness of payloads with uniform data elements 202 (e.g., all the data being 32-bit floating point). Although this 203 approach does not address the problems of transporting general 204 structures (e.g., a "struct" of C), it does allows the 205 participation of smart memory cards as PktWay nodes, as well as 206 supporting direct memory access (DMA) operations. 208 The PktWay protocol is designed to allow wormhole (or 209 "cut-through") forwarding, in which a router can start forwarding 210 packets after receiving the first four bytes only (that include 211 the PktWay-protocol version, priority, and the destination-type) 212 without waiting for information that may not be needed for the 213 packet forwarding task. This is unlike IP routers that receive 214 the sender address before receiving the destination address, even 215 though the former is not always needed whereas the latter is. 217 PktWay's addresses are short (23 bits) because, unlike IP, PktWay 218 is not designed for global operation. The amount of state that is 219 stored in the half-routers per node (type, name, paths, 220 capabilities, etc.) makes it impractical for scalability beyond a 221 few tens (hundreds?) of thousands of nodes, over a (relatively) 222 small number of SANs. 224 PktWay does not support SAR (Segmentation And Reassembly). 225 Instead, it provides means for hosts to discover the minimum 226 transmission unit (MTU) over several alternative paths to any 227 other node. A PktWay packet must never exceed the minimum MTU 228 along all the network hops from the source node to the destination 229 node. 231 Cohen et al [Page 4] 232 Several protocol extensions, which are layered on the core PktWay 233 protocol, have been defined. These include dynamic resource and 234 routing discovery, secure PktWay, and multicast PktWay. These 235 protocol extensions will be described in documents to be provided 236 later. 238 1c. The Level-2 Operation of PktWay 240 PktWay's goal is to move data from a source node, (on some 241 arbitrary SAN) to a destination node, (either on the same SAN, or 242 on another SAN). Sources and destinations can be physical 243 entities, such as a processor or a smart memory board, or logical 244 entities, such as a group of cooperating processes or a collection 245 of threads. Sources, destinations, and routers are such nodes. 247 Within each PktWay configuration all nodes have unique 23-bit 248 physical PktWay addresses. A system designer can assign these 249 PktWay addresses manually. Alternatively, the optional PktWay 250 Server Layer may provide a way to assign and discover addresses 251 dynamically. Throughout this document "address" always means the 252 23-bit physical PktWay address. 254 To optimize for performance, PktWay has a data transfer mode that 255 directly leverages the native message routing schemes used within 256 each SAN. This mode uses a "Planned Transfer" paradigm. During 257 the planning phase, a source node collects information on optimal 258 routes to a destination, expressed in the various native formats 259 of all the intervening SANs. A source node later uses this 260 information for low latency transfers to that destination. In 261 PktWay, the transfer phase of a Planned Transfer is called 262 "L2-forwarding". The RRP document demonstrates the use of 263 L2-forwarding. 265 PktWay also supports a more traditional data transfer mode that 266 requires no planning. Such transfers specify the destinations by 267 their addresses only. In PktWay, this more traditional approach 268 is called "L3-forwarding". 270 PktWay packets may be routed by Level-2 (L2) forwarding, Level-3 271 (L3) forwarding, or a combination thereof. 273 In L3-forwarding (similar to IP forwarding), the L2-routing 274 through each SAN is determined by an inter-SAN router upon 275 entering that SAN. The router prefixes the packet with an L2 276 routing header (such as a source route) corresponding to the 277 destination address specified in the packet directing the packet 278 either to its destination or to an intermediate router. It is a 279 task for that router to determine the L2-routing-header 280 corresponding to the given PktWay-address. 282 Cohen et al [Page 5] 283 In L2-forwarding the source prefixes the packet with all the 284 L2-routing headers needed along the entire path to the 285 destination. Each router has only to get the L2-routing-header 286 from the leading L2RH (L2-Routing-Header record) that was provided 287 by the source. 289 PktWay allows hosts to construct a source-route built entirely of 290 Level-2 headers, allowing each SAN to exploit the full performance 291 of its native interconnection fabric. These SAN-headers 292 (equivalent to MAC-headers) are provided by the SANs that will use 293 them, in their native format. PktWay does not define the format 294 of the local routing envelope. Instead, it defines how the 295 encapsulated PktWay packets should be passed between half-routers, 296 leaving it up to the local network of each SAN to properly deliver 297 the packet. 299 If hosts so prefer, they can address their destinations either by 300 any arbitrary name, a PktWay physical address (which is handled 301 like the Level-3 IP-address), or by concatenating a sequence of 302 Level-2 SAN-headers. Although the generation of a sequence of L2 303 Routing Headers requires more effort to construct initially, 304 PktWay source routing results in considerably lower network 305 latencies, as the packets are allowed to cut-through route through 306 the intervening SAN networks . 308 2. A note about the PacketWay Documents 310 The PacketWay protocol is defined by a series of documents: 311 * EEP (End-to-End Protocol) 312 * RRP-1 (basic Router-to-Router Protocol) 313 * RRP-2 (dynamic inter-SAN routing) 314 * PktWay enumerations 316 Each of these documents should include the same "PacketWay at a 317 Glance (Cheat-Sheet)", this note, and the Notations page. They 318 should include also (as appendices) a copy of the PacketWay glossary 319 of terms and its acronyms and abbreviations list. 321 The EEP and the RRP documents will be published first as 322 Internet-Drafts and later as Proposed-Standards, Draft-Standards, 323 and Standards. 325 The Enumeration Document will be first published as an 326 "Informational-RFC" and later will be maintained by IANA. 328 The enumeration document may be attached to the EEP/RRP documents, as 329 a matter of convenience. The enumeration is NOT a part of the PktWay 330 standard, just as RFC0739 (the original "Assigned Numbers" RFC) is 331 not a part of RFC0791, that defines IP. 333 Cohen et al [Page 6] 334 Similarly, the EEP-document has "Appendix-A: A Recommendation for 335 PktWay Address Assignment" which is a recommendation only and NOT 336 a part of the PktWay standard, just as IP-address-assignment is not 337 a part of RFC0791, that defines IP. 339 The appendices are brought for clearance and convenience. They are 340 not a part of the PktWay specification. 342 Information about the PktWay activity may be found in the URL: 343 http://www.erc.msstate.edu/PktWay/ 345 3. Notations 347 The shorter "PktWay" is used for "PacketWay". 349 8B means "8-byte" (64 bits). 351 0x indicates hexadecimal values, e.g., 0x0100 is 2^8=256(decimal). 353 0b indicates binary values, e.g., 0b0100 is 4(decimal). 355 xxxx indicate a field that is discarded without any checking (e.g., 356 padding). 358 [fff] indicates that fff is an optional field, when appropriate. 360 [exp] in equations, is the integral part, rounded down, of `exp`. 361 e.g., [23/8]=2. 363 All length fields do not include themselves, and therefore may be 364 zero. 366 Lengths are specified either (a) by byte count, implying that some 367 padding bytes may follow to fill 8B-words, or (b) by 8B-word count 368 and PL, the number of trailing padding bytes (with PL between 0 369 and 7). 371 4. PktWay EEP Messages 373 4a. The Pktway Message Structure 375 PktWay messages have 6 components, including 4 optional ones: 377 [1]: [Optional Sequence of L2-Routing-Headers and Symbols] 378 [2]: EEP Header (16 bytes) (PH) 379 [3]: [Optional Header fields] (OH) 380 [4]: [Optional, Most likely: Data Block] (DB) 381 [5]: [Optional Trailer fields] (OT) 382 [6]: EEP Trailer (8 bytes) (TAIL) 384 Cohen et al [Page 7] 385 4b. The Optional Fields 387 [1]: as explained later, if the 9th+10th bits of a messages are 388 0b10 then the message starts with an L2RH, but if the 9th 389 through the 12th bits of a message are 0b1111 then this 390 message starts with a "symbol". The other values of these 391 4 bits indicate the lack of L2RH and symbols and that the 392 message begins with the EEP-header. 394 [3]: if the h-bit in the EEP header [2] is 1 then there are 395 optional header (OH) fields. The sequence of these OH fields 396 is terminated with an OH field marked as being the last one 397 (with C=1). 399 [4]: if DL>0, in the EEP header, zero then a Data Block (DB) is 400 included in this message. 402 [5]: the optional header fields, [3], may indicate that some 403 optional trailer fields are present after the DB, [4]. The 404 order and the formats of the trailer fields are defined by 405 the optional header fields. 407 It is expected that most messages will have Data Blocks (DB), and 408 that most messages will not have Optional Header fields (OH), nor 409 Optional Trailer fields (OT). 411 Leading L2RHs and symbols [1] are consumed by the HRs before 412 reaching the destination which receives only the other components, 413 [2] through [6]. These parts, [2] to [6], constitute the 414 End-to-End Protocol of PktWay. 416 TAIL, the EEP trailer, [6] may be modified along the way to the 417 destination, unlike [2], [3], [4] and [5], which arrive exactly as 418 sent by the source. 420 Each PktWay packet may be first L2-forwarded (zero or more times) 421 before being L3-forwarded (zero or more times). 423 Although PktWay headers and trailers are always in Big Endian 424 order, the byte order of the Data Block is not defined by PktWay. 426 Since all the elements of PktWay (L2RHs, EEP-headers, optional 427 fields, data, and EEP-trailers) are always multiples of 8B-words, 428 it is recommended that PktWay headers (and data) be aligned on 429 8B-boundaries in the nodes' memory. 431 Cohen et al [Page 8] 432 5. Optional Sequence of L2RHs and Symbols [1] 434 PktWay messages may start with a mix of L2RHs and symbols. 436 A PktWay source may specify native routes, by placing the native 437 routes before the PktWay Header. The native routes (for all SANs and 438 LANs beyond the initial one) must appear within a sequence of PktWay 439 L2-Routing-Header records (L2RH). 441 In certain situations symbols may be included among the L2RHs. These 442 symbols are used for conveying information to the routers that handle 443 the messages, such as about encryption. A symbol does not specify 444 its destination and is processed (and consumed) by the entity that 445 encounters it. 447 In L2-forwarding each intermediate HR consumes an L2RH and the 448 preceeding symbols (if any). When a packet reaches its destination 449 all of [1] (the Optional Sequence of L2RHs and Symbols) should be 450 consumed. 452 5a. L2 Routing Headers (L2RHs) Records 454 The contents of the L2RH are totally SAN dependent, with the 455 exception of the first 2 bytes that distinguish this record from 456 an EEP-header and also provide the Length (0| 535 An L2RH with an SR with 13 routing bytes: 537 0b10 L=13 #1 #2 #3 #4 #5 #6 538 +--------+--------+--------+--------+--------+--------+--------+--------+ 539 |00000000|10001101| SR01 | SR02 | SR03 | SR04 | SR05 | SR06 | 540 +--------+--------+--------+--------+--------+--------+--------+--------+ 541 | SR07 | SR08 | SR09 | SR10 | SR11 | SR12 | SR13 | xxxx | 542 +--------+--------+--------+--------+--------+--------+--------+--------+ 543 #7 #8 #9 #10 #11 #12 #13 padding 545 5b. Symbol Records 547 5b1. Symbol Format: 549 Each symbol is in the format: 551 +--------+--------+--------+--------+--------+--------+--------+--------+ 552 |vv000000|1111ssss|ssssssss|ssssssss| Length | data |........|........| 553 +--------+--------+--------+--------+--------+--------+--------+--------+ 554 ^^^^<---- Symbol-Type ---> 556 The 5th byte is the byte-count (L) of the data for this field 557 that starts in the next byte, and is padded with as many 558 padding bytes as needed to fill 8B-words. 560 The length (L) does not include itself, hence it is between 0 561 and 255. The total number of 8B-words in the symbol L2RH is 562 [(L+12)/8] where the square brackets indicate the integer part, 563 rounded down, of the quantity within. Therefore, the number of 564 padding bytes is PL=8*[(L+12)/8]-2-L. 566 Symbols may be mixed among the L2RHs, before the EEP-header. 568 The values of the Symbol-Type field are defined in the PktWay 569 Enumeration document. 571 5b2. Symbol Example: 573 A symbol with 9 data bytes. 575 0b1111<---- Symbol Type --->L=9 Bytes 576 +--------+--------+--------+--------+--------+--------+--------+--------+ 577 |vv000000|1111ssss|ssssssss|ssssssss|00001001| data1 | data2 | data3 | 578 +--------+--------+--------+--------+--------+--------+--------+--------+ 579 | data4 | data5 | data6 | data7 | data8 | data9 | xxxx | xxxx | 580 +--------+--------+--------+--------+--------+--------+--------+--------+ 581 6. EEP Header [2] 583 The EEP (aka PH) has 16 bytes. 585 2 6 24 16 16 586 +-+------+-------+--------+---------+--------+--------+--------+--------+ 587 |V| P | Destination-Type | Type-Extension | Packet-Type | 588 +-+-+---++--------------------------+-+------+--------+--------+--------+ 589 | E | PL| Data-Length>=0 (8B-words) |h| RZ |0 Source-Address | 590 +---+---+--------+--------+---------+-+------+--------+--------+--------+ 591 4 3 25 1 7 24 593 These fields are described below: 595 Bytes.bits 596 a. Version (V) 0.2 597 b. Priority (P) 0.6 598 c. Destination-Type (DT) 3.0 599 d. Packet Type Extension (TE) 2.0 600 e. Packet Type (PT) 2.0 601 f. Endianness (E) 0.4 602 g. Padding Length (PL) 0.3 603 h. Data Length (DL) 3.1 604 i. Options flag (h) 0.1 605 j. Reserved (RZ) 0.7 606 k. Source Address (SA) 3.0 608 6a. Version (V) 2 bits 610 This field is static. Its 2 bits are 0b00 for the working version 611 of the protocol. These bits should have other values for 612 co-existing experimental versions. 614 6b. Priority (P) unsigned integer, 6 bits 616 It is anticipated that some SANs, especially those working in real 617 time, will want to implement priorities. This field supports such 618 usage. 620 All ones is the highest priority, and all zeroes the lowest. 621 Ideally, packets with higher priority should gain access to 622 contested resources before packets with lower priority. 623 Implementations may ignore the Priority field. 625 6c. Destination-Type (DT) 24 bits 627 The purpose of this field is to specify the header type, as well as 628 the destination of the packet, when applicable. 630 This field may specify: 632 * A physical PktWay address (of 23 bits); 633 * An L2-Routing-Header (L2RH) of a variable length; 634 * A logical address (of 20 bits); or 635 * A symbol (of 20 bits). 637 In addition, it is anticipated that additional types will be needed 638 in the future. 640 A variant of Huffman coding is used to accommodate all these 641 methods for the Destination-Type field. This is done by assigning 642 the MSbit of 0 to physical addresses, 2 MSbits of 0b10 to L2RH, 643 3 MSbits of 0b110 to future needs, 4 MSbits of 0b1110 to logical 644 addresses, and 4 MSbits of 0b1111 to symbols. 646 This assignment is summarized in the following table: 648 MSbits | Method 649 --------+---------- 650 0xxx | Physical 651 10xx | L2RH 652 110x | Reserved 653 1110 | Logical 654 1111 | Symbol 656 A single C-style 16-way switch can dispatch quickly the protocol 657 processor to the right handler required for any of the methods used 658 to specify the destination. 660 The Physical addresses are unique within each instance of PktWay. 661 Nodes should have addresses assigned to them. The method of 662 assigning unique addresses within each PktWay is not specified 663 here. 665 Examples of potentially addressable PktWay nodes include: groups of 666 cooperating processes, an entire MPP, or each of an MPP's many 667 processors or processes. 669 The 0b10xx was chosen for L2RH to be consistent with the 0b10 670 indication of L2RHs, as described earlier in this document. 672 "Logical Addresses" (e.g., for broadcast and for multicast groups) 673 are also in this address space. The destination-Type is a "Logical 674 Address" if its 4 MSbits are set to 0b1110. 676 A few Physical-addresses are reserved: 678 0x000000 Undefined address (illegal where an address is expected, 679 but is allowed in the SA field) 681 0x7FFFFE ("Hey-You!") This address could be used at power up 682 to address nodes or routers, over point-to-point links. 683 ("If you receive it, it's for you.") 685 0x7FFFFF (Broadcast) This address is reserved for broadcast 686 operations which may be added in later versions. 687 ("If you receive it, it's for you.") 689 6d. Type Extension (TE) 2 bytes 691 An extension of the following PT field. 693 Logically, the TE should be after the PT. However, the PT is 694 8B-word aligned, easier to process than the TE which is 2B-aligned, 695 but not 8B-aligned. Since the PT is more frequently used than the 696 TE, it was assigned to the better aligned field. 698 6e. Packet Type (PT) 2 bytes 700 The PT field provides the information needed for efficient 701 de-multiplexing of multiple protocol layers. Whereas traditional 702 protocol layering requires several stages of sequential 703 de-multiplexing, PktWay provides enough information to support a 704 single combined de-multiplexing operation (such as in support of 705 zero copy TCP). Thus, the PT field may indicate, for example, that 706 the data blocks contain IP, SNMP, ATM, Ethernet, or other layered 707 protocols. 709 PT values to support popular parallel programming APIs such as MPI 710 have been defined. The PktWay Enumeration document defines several 711 values for this PT field. 713 The PT field value of "RRP" indicates that message contains 714 commands used in the PktWay Router-to-Router Protocol (RRP). 716 Some PTs will also use the 2 byte Type Extension (TE) field which 717 precedes the PT for passing PT-specific parameters, such as 718 implementation specific de-multiplexing information. 720 RRP messages (as described in the PktWay RRP document) use the TE 721 field to distinguish among the various RRP-messages. 723 Special Packet Types 725 RRP - PktWay's Router/Router protocol (see the RRP document). 727 ERR - Error reporting packet, usually sent to the Source Address 728 (SA, see below) in response to a PktWay message that could 729 not be properly handled, such as "Destination Unknown." 730 The TE indicates the nature of the error (e.g., UNK) as 731 defined in the PktWay Enumeration document. 733 6f. Endianness (E), 4 bits 735 If the SAN interface of the receiving-node detects Endianness 736 that is different than its own and if the entire Data Block (DB) 737 consists of N-byte fields, then it may activate byte-swapping 738 hardware for N-byte fields, saving much work for the receiving 739 node. 741 The first bit (MSbit) of E, 'e' indicates whether the DB is in 742 Big-Endian order (e=0) or in Little-Endian order (e=1). The next 743 3 bits could control hardware byte swapping, if any, which assumes 744 that all the data consists of words of the same length. 746 The meaning associated with the values of the 3 LSbits of this 747 field are defined in the PktWay enumeration document. 749 6g. Pad Length (PL) unsigned integer, 3 bits 751 The number of padding bytes that were added at the end of the DB 752 (i.e., from the end of the data to the end of the DB). PL can be 753 between 0 and 7. 755 6h. Data Length (DL) unsigned integer, 25 bits 757 Length, in 8B-words, of the data block, not including the L2RHs, 758 EEP-header, OH, OT, and TAIL, including any optional padding. 759 Hence, the net length of the Data Block is 8*DL-PL bytes. The 760 minimum is zero, and the maximum length is (2^25-1)*8 bytes = ~2^28 761 = 256 MBytes. 763 6i. Optional Header-Field Flag (h) 1 bit 765 This bit is set to 1 if there are one (or more) optional header 766 (OH) fields following the standard 16-byte EEP-header. 768 6j. Reserved (RZ) 7 bits 770 This field is reserved for future use. Applications should neither 771 use it, nor count on others not to use it. It should be always set 772 to zero (0b0000000). 774 6k. Source Address (SA) 24 bit 776 This field contains the physical address of the packet's original 777 source in the same format as the DT. However, unlike the DT, the 778 SA must be a physical address. 780 Filling in this field is optional. A value of zero means that the 781 SA is not specified. 783 Routers may use this field to identify the sender to which error 784 messages may be returned. 786 7. Optional Header Fields (OH) [3] 788 A PktWay-message has Optional Header fields (OH) following the 789 EEP-header, if the Option-Flag (h) is set to 1 in the EEP-header. 791 Each OH is in the format: 793 +--------+--------+--------+--------+--------+--------+--------+--------+ 794 |tttttttt|LLLLLLLL| data |........|........|........|........|........| 795 +--------+--------+--------+--------+--------+--------+--------+--------+ 797 The first byte indicates the optional header field type (OH-TYPE). 799 The first bit, T, of the first byte indicates the processing of this 800 OH-TYPE: 802 T=0: Optional (may drop this field if this OH-TYPE is unknown) 803 T=1: Mandatory (should not process this message if this OH-TYPE 804 is unknown) 806 The second bit, C, of the first byte indicates whether there are more 807 header fields (i.e., whether this is the last field of this message). 809 C=0: More Optional Header fields follow 810 C=1: End of Optional Header fields group (i.e., this is the last OH) 812 The other 6 bits of this byte, tttttt, define application-specific 813 OH-TYPEs. 815 The second byte is the byte-count (L) of the data for this field that 816 starts in the next byte, and is padded with as many padding bytes as 817 needed to fill 8B-words. 819 The length (L) does not include itself, hence it is between 0 and 820 255. The total number of 8B-words in the symbol L2RH is [(L+9)/8] 821 where the square brackets indicate the integer part, rounded down, 822 of the quantity within. Therefore, the number of padding bytes is 823 PL=8*[(L+9)/8]-2-L. 825 Example: An Optional Header Field (OH) with a mandatory OH-TYPE and 826 4 data bytes: 828 L=4 #1 #2 #3 #4 padding padding 829 +--------+--------+--------+--------+--------+--------+--------+--------+ 830 |1xtttttt|00000100| data01 | data02 | data03 | data04 | xxxx | xxxx | 831 +--------+--------+--------+--------+--------+--------+--------+--------+ 832 |<------------- value ------------->| 834 8. Optional Data Block (DB) [4] 836 The DB is free for applications to use in any way. Routers must not 837 modify this field. 839 The DB has DL 8B-words, including optional padding (at the end) of PL 840 bytes. Hence, the number of data bytes is 8*DL-PL. Both DL and PL 841 are specified in the EEP-header. 843 The maximum length of the DB is 8*(2^25-1)B = ~256 MByte. 845 9. Optional Trailer Fields (OT) [5] 847 A PktWay-message has Optional Trailer fields (OT) if so indicated in 848 an Optional Header field, e.g., an OH field may indicate that a CRC64 849 is in the OT. 851 An OT may have just the data for an OH defined above (following the 852 EEP header), or be a stand alone, self-defined field in the same 853 format as OH. 855 The OT-fields are in the order defined by the OHs. For example, if 856 an OH-field indicating that a CRC32 is in the OT, is followed by 857 another OH-fields indicating that a CRC64 is in the OT, then the OT 858 with the CRC32 should be followed by the OT with the CRC64. Self 859 defined OT fields must follow OTs defined by the OHs. 861 10. EEP Trailer (TAIL) [6] 863 The TAIL consists of only the Error Indication (EI) field which is a 864 single 8B-word. 866 Routers may start forwarding packets toward their destinations before 867 detecting transmission errors (such as in wormhole routing). The EI 868 field provides such routers with a means to append an error 869 indication to the end of a packet. 871 An all zero EI value means that no error was indicated. Any non-zero 872 EI value indicates one or more errors. 874 The packet source will usually initialize the EI field to all zeros. 875 However, as an alternative example, a memory board may create a 876 packet with a non zero EI field (EI=1) that indicates that a parity 877 error was detected by the memory board. 879 Each router does an arithmetic left shift, on the EI field by one bit 880 unless its MSbit is 1. Routers that detect transmission errors also 881 set the LSbit (after the shift) to 1. 883 This provides the ability to identify which routers have indicated 884 errors (if the route is known). 886 11. Appendix-A: A Recommendation for PktWay Address Assignment 888 This section of the EEP document is a recommendation only, and not a 889 part of the PktWay standard. 891 Unlike IP addresses, physical PktWay addresses are not globally 892 unique, but must be locally unique within each PktWay configuration. 893 Hence, when SANs that were developed independently are interconnected 894 to form a PktWay, conflicting physical addresses may occur. 896 It is recommended not to attempt to assure local uniqueness of 897 physical addresses by subdividing the global address space (hence, 898 attempting to achieve global uniqueness). 900 Instead, it is recommended that every SAN would have local PktWay 901 addresses, between 1 and the number of its local nodes, and also have 902 a global "bias" to be added to all the addresses in that SAN. Hence, 903 by proper setting of the biases of interconnected SANs, the local 904 uniqueness of PktWay addresses is achieved. 906 The coordination of these biases is left (at least now) for manual 907 (static) out-of-band coordination. 909 The use of such biases simplifies the mapping of physical addresses 910 to their SANs. 912 12. Appendix-B: Glossary 914 Address: A unique designation of a node (actually an interface 915 to that node) or a SAN. 916 Buddy-HR: HRs are "buddies" if they are on the same SAN. 917 Cut-Thru: See wormhole. 918 Destination: The node to which a packet is intended 919 Dynamic-Routing: Routing according to dynamic information 920 (i.e., acquired at run time, rather than pre-set). 921 Endianness: The property of being Big-Endian or Little-Endian 922 (transmission order, etc.) 923 Ethertype: A 16-bit value designating the type of Level-3 924 packets carried by a Level-2 communication system. 925 HR: Half-Router, the part of a router that handles one 926 network only. 927 L2-Forwarding: Forwarding based on Level-2 (i.e., data-link layer 928 of the ISORM) information, e.g., the native technique 929 of each SAN or LAN. Also called "source routing." 930 L3-Forwarding: Forwarding based on end-to-end 931 (Level-3 i.e., network layer of the ISORM) addresses. 932 Also called "destination routing." 933 Map: The topology of a network. 934 Mapper: A node on a SAN/LAN that has the map and an RT 935 for that network. It is expected that the mapper 936 dynamically updates the map and the RT. 937 Multi-homed Node: A node with more than one network interface, where 938 each interface has another address. 939 Node: Whatever can send and receive packets 940 (e.g., a computer, an MPP, a software process, etc.) 941 Node structure: A C-struct (or equivalent) containing values for some 942 attributes of a node. 943 Planned Transfer: Transfer of information, occurs after an initial 944 phase in which the sender decides which Level-2 route 945 to use for that transfer. 946 RCVF: The "Received From" set includes all the physical 947 addresses through which an RT was disseminated, 948 starting with the address of the mapper that created 949 that RT. 950 Re-direct-message: A message that tells nodes which HR should be 951 used in order to get to a certain remote address. 952 Router: The inter-SAN communication device 953 Security Context: A relationship between 2 (or more) nodes that 954 defines how the nodes utilize security services to 955 communicate securely. 956 Source: The node that created a packet. 957 Source-Route: A Level-2 route that is chosen for a packet by its 958 source. 959 Symbol: Data preceeding the EEP header of a PktWay message, 960 interleaving with the L2RHs. 962 Twin-HR: Two HRs are twins if they both are parts of the same 963 inter-SAN router. 964 Wormhole-routing: (aka cut-thru routing) forwarding packets out of 965 switches as soon as possible, without storing that 966 entire packet in the switch (unlike Stop-and-forward) 967 Zero-copy TCP: A TCP system that copies data directly between the 968 user area and the network device, bypassing OS copies 970 13. Appendix-C: Acronyms and Abbreviations 972 0bNNNN The binary number NNNN (e.g., 0b0100 is 4-decimal) 973 0xNNNN The hexadecimal number NNNN (e.g., 0x0100 is 256-decimal) 974 8B 8 byte (64 bits) entity 975 ADDR The Address-record of RRP 976 APIn Application/Program Interface 977 AT Address Type 978 ATM Asynchronous Transmission Mode 979 B Byte (e.g., 4B) 980 b bit (e.g., 32b) 981 BC Byte Count (of parameters) 982 BER Bit Error Rate 983 CAPA The CAPAbility-record of RRP 984 CC Capability Code 985 CSR Common Source-Route 986 DA Destination Address 987 DB Data Block 988 DL Data Length (in 8B words) 989 DSP Digital Signal Processor 990 DT Destination-Type 991 e The MSbit of E 992 E The Endianness field (in the EEP header) 993 EEP End/End Protocol 994 EI Error Indication 995 GP General Purpose 996 GVL2 An RRP message, requesting L2 route to a given destination 997 GVRT An RRP message asking an HR to give its routing tables 998 h Optional header fields flag 999 HR Half Router 1000 HRTO An RRP message asking which HR to use for a given destination 1001 ID Identification 1002 IGMP Internet Group Management Protocol 1003 INFO An RRP message providing information about nodes 1004 IP The Internet protocol 1005 ISORM The ISO Reference Model 1006 L Length field (exclusive of itself) 1007 L2 Level-2 of the ISORM (Link) 1008 L2RH Level-2 Routing Header 1009 L2SR Source Route 1010 L3 Level-3 of the ISORM (Network) 1011 LA Logical Address 1012 LADR The Logical-addresses-record of RRP 1013 LAN Local Area Network 1014 LRT Local Routing Table 1015 LSbit Least Significant bit 1016 LSbyte Least Significant byte 1017 MAC Message Authentication Code / Media Access Control 1018 MPI Message Passing Interface 1019 MPP Massively Parallel Processing system 1020 MSbit Most Significant bit 1021 MSbyte Most Significant byte 1022 MSU Mississippi State University 1023 MTU Maximum Transmission Unit 1024 MTUR The MTU-record of RRP 1025 M/C Multicast 1026 NAME The name-record of RRP 1027 NFS Network File Server 1028 OH Optional Header field 1029 OH-TYPE The Type of an Optional Header field 1030 OT Optional Trailer field 1031 P The Priority field 1032 PAD Padding After Data 1033 PBD Padding Before Data 1034 PCI The Peripheral Component Interconnect "standard" 1035 PH PacketWay Header 1036 PL Padding Length (always in bytes) 1037 PPP The Point-to-Point Protocol 1038 PROM Programmable ROM (Read-Only-Memory) 1039 PT Packet Type (2B) 1040 PVM Parallel Virtual Machine 1041 PW The Myrinet Packet Type assigned to PktWay (PW=0x0300) 1042 Q Quality (of a path) 1043 RCVF Received-From list, or the Received-From record of RRP 1044 RDRC A re-direct message of RRP 1045 RH Routing Header 1046 RID Record ID 1047 RL Record Length (in 8B-words) 1048 RRP Router/Router Protocol 1049 RT-hd RT (Routing Table) header 1050 RT Routing Table 1051 RTBL An RRP message proving a Routing Table 1052 RTHD The Routing-Table-Header record of RRP 1053 RTyp RRP's Record Type 1054 RZ The Reserved field (in the EEP header) 1055 SA Source Address 1056 SAN System Area Network 1057 SAN-ID The 24-bit PktWay-address of a SAN 1058 SAR Segmentation and Reassembly 1059 SN Serial Number 1060 SNID SAN-ID 1061 SNMP Simple Network Management Protocol 1062 SR Source Route (always at Level-2) 1063 SRQR The Source-Route-and-Q-record of RRP 1064 ST Symbol Type 1065 TAIL PacketWay EEP Trailer 1066 TE Type Extension (2B) 1067 TELL An RRP message requesting information about nodes 1068 partially specified 1069 UNK Unknown 1070 V Version 1071 WRU? An RRP message asking its recipient to identify itself 1072 XRT External Routing Table 1073 xxxx A padding byte 1075 14. Appendix-4: PktWay at a Glance (aka "The Cheat-Sheet") 1077 2 6 type 24 16 16 1078 +-+------+-------+--------+---------+--------+--------+--------+--------+ 1079 |V| P | Destination-Type | Type-Extension | Packet-Type | 1080 +-+-+---++--------------------------+-+------+--------+-----------------+ 1081 | E | PL| Data-Length (8B-words) |h| RZ |0 Source-Address | 1082 +---+---+--------+--------+---------+-+------+--------+--------+--------+ 1083 4 3 25 1 7 1 23 1085 type = 0xxx Physical Address 1086 10xx L2RH 1087 110x Reserved 1088 1110 Logical Address 1089 1111 Symbols 1090 L2RH: 1091 2 6 2 6 8 8 8 8 8 8 1092 +--------+--------+--------+--------+--------+--------+--------+--------+ 1093 |V| P |10LLLLLL| SR01 | SR02 |........|........|........|........| 1094 +--------+--------+--------+--------+--------+--------+--------+--------+ 1095 Length 1096 Symbol: 1097 2 6 4 6 8 8 8 8 8 8 1098 +--------+--------+--------+--------+--------+--------+--------+--------+ 1099 |V| P |1111ssss|ssssssss|ssssssss| Length | data |........|........| 1100 +--------+--------+--------+--------+--------+--------+--------+--------+ 1101 <---- Symbol Type ---> 1102 Optional Header: 1103 2 6 8 8 8 8 8 8 8 1104 +--------+--------+--------+--------+--------+--------+--------+--------+ 1105 |TCtttttt|LLLLLLLL| data |........|........|........|........|........| 1106 +--------+--------+--------+--------+--------+--------+--------+--------+ 1107 T: 0=optional, 1=mandatory; C: 0=more OH-fields follow, 1=last OH-field 1109 RRP Record: 1110 8 8 8 8 8 8 8 8 1111 +--------+--------+--------+--------+--------+--------+--------+--------+ 1112 | RTyp | PL | RL |........|........|........|........| 1113 +--------+--------+--------+--------+--------+--------+--------+--------+ 1114 RRP-messages: GVL2, L2SR, RDRC, TELL, INFO, HRTO, WRU, GVRT, RTBL; 1115 RTyp: ADDR, NAME, CAPA, LADR, SRQR, MTUR, RCVF, RTHD; 1117 15. Security Considerations 1119 This RFC raises no security issues. 1121 16. Editor' Address 1123 Danny Cohen 1124 Myricom, Inc. 1125 325 N. Santa Anita Ave 1126 Arcadia, CA 91006 1128 Phone: 626-821-5555 1129 Fax: 626-821-5316 1130 Email: Cohen@myri.com