idnits 2.17.1 draft-faibish-nfsv4-scsi-nvme-layout-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 6, 2020) is 1383 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 7525 (Obsoleted by RFC 9325) Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network File System Version 4 S. Faibish 2 Internet-Draft January 1, 2020 D. Black 3 Intended status: Informational Dell EMC 4 Expires: January 6, 2021 C. Hellwig 5 July 6, 2020 7 Using the Parallel NFS (pNFS) SCSI/NVMe Layout 8 draft-faibish-nfsv4-scsi-nvme-layout-00 10 Abstract 12 This document explains how to use the Parallel Network File System 13 (pNFS) SCSI NVMe Layout Types with transports using the 14 NVMe over Fabrics protocols. This draft picks the previous SCSI 15 over NVMe draft of C. Hellwig and extended it to support all the 16 types of transport protocols supported by NVMe transport over fabrics 17 additional to the SCSI transport protocol introduced in pNFS SCSI 18 Layout. The proposed transport protocols include support for 19 several transports and fabrics and support the RDMA transports. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at https://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 The list of Internet-Draft Shadow Directories can be accessed at 37 https://www.ietf.org/standards/ids/internet-draft-mirror-sites/. 39 This Internet-Draft will expire on January 6, 2021. 41 Copyright Notice 43 Copyright (c) 2020 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (https://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2 59 1.1. Conventions Used in This Document . . . . . . . . . . . . . 2 60 1.2. General Definitions . . . . . . . . . . . . . . . . . . . . 2 61 2. SCSI Layout mapping to NVMe . . . . . . . . . . . . . . . . . . 3 62 2.1. Volume Identification . . . . . . . . . . . . . . . . . . . 7 63 2.2. Client Fencing . . . . . . . . . . . . . . . . . . . . . . 7 64 2.2.1. Reservation Key Generation . . . . . . . . . . . . . . 8 65 2.2.2. MDS Registration and Reservation . . . . . . . . . . . 8 66 2.2.3. Client Registration . . . . . . . . . . . . . . . . . . 8 67 2.2.4. Fencing Action . . . . . . . . . . . . . . . . . . . . 8 68 2.2.5. Client Recovery after a Fence Action . . . . . . . . . 9 69 2.3. Volatile write caches . . . . . . . . . . . . . . . . . . . 10 70 3. Security Considerations . . . . . . . . . . . . . . . . . . . . 10 71 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 10 72 5. Normative References . . . . . . . . . . . . . . . . . . . . . 11 73 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 11 75 1. Introduction 77 The pNFS Small Computer System Interface (SCSI) layout [RFC8154] is 78 layout type that allows NFS clients to directly perform I/O to block 79 storage devices while bypassing the MDS. It is specified by using 80 concepts from the SCSI protocol family for the data path to the 81 storage devices. This documents explains how to access PCI Express, 82 RDMA or Fibre Channel devices using the NVM Express protocol [NVME] 83 using the SCSI layout. This document does not amend the pNFS SCSI 84 layout document in any way, instead of explains how to map the SCSI 85 constructs used in the pNFS SCSI layout document to NVMe concepts 86 using the NVMe SCSI translation reference. 88 1.1. Conventions Used in This Document 90 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 91 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 92 document are to be interpreted as described in [RFC2119]. 94 1.2. General Definitions 96 The following definitions are provided for the purpose of providing 97 an appropriate context for the reader. 99 Client The "client" is the entity that accesses the NFS server's 100 resources. The client may be an application that contains the logic 101 to access the NFS server directly. The client may also be the 102 traditional operating system client that provides remote file 103 system services for a set of applications. 105 Server/Controller The "server" is the entity responsible for 106 coordinating client access to a set of file systems and is 107 identified by a server owner. 109 2. SCSI Layout mapping to NVMe 111 The SCSI layout definition [RFC8154] only references few SCSI 112 specific concepts directly. 114 NVM Express [NVME] Base Specification revision 1.4 and prior 115 revisions define a register level interface for host software to 116 communicate with a non-volatile memory subsystem over PCI Express 117 (NVMe over PCIe). This specification defines extensions to NVMe 118 that enable operation over other interconnects (NVMe over Fabrics). 119 The NVM Express Base Specification revision 1.4 is referred to as 120 the NVMe Base specification. 122 The goal for this draft is to enable an implementer who is 123 familiar with the pNFS SCSI layout (RFC 8154) and the NVMe 124 standards (both NVMe-oF 1.1 and NVMe 1.4) to implement the 125 pNFS SCSI layout over NVMe-oF. The mapping of extensions defined 126 in this document refers to a specific NVMe Transport defined in 127 an NVMe Transport binding specification. This document refers to 128 NVMe Transport binding specification for FC, RDMA and TCP [RFC7525]. 129 The NVMe Transport binding specification for Fibre 130 Channel is defined in INCITS 540 Fibre Channel - Non-Volatile 131 Memory Express [FC-NVMe]. 133 NVMe over Fabrics has the following differences from the NVMe Base 134 specification used with SCSI: 135 - There is a one-to-one mapping between I/O Submission Queues and 136 I/O Completion Queues. NVMe over Fabrics does not support multiple 137 I/O Submission Queues being mapped to a single I/O Completion 138 Queue; 140 - NVMe over Fabrics does not define an interrupt mechanism that 141 allows a controller to generate a host interrupt. It is the 142 responsibility of the host fabric interface (e.g., Host Bus 143 Adapter) to generate host interrupts; 145 - NVMe over Fabrics does not use the Create I/O Completion Queue, 146 Create I/O Submission Queue, Delete I/O Completion Queue, and 147 Delete I/O Submission Queue commands. NVMe over Fabrics does 148 not use the Admin Submission Queue Base Address (ASQ), Admin 149 Completion Queue Base Address (ACQ), and Admin Queue Attributes 150 (AQA) properties (i.e., registers in PCI Express). Queues are 151 created using the Connect command; 152 - NVMe over Fabrics uses the Disconnect command to delete an I/O 153 Submission Queue and corresponding I/O Completion Queue; 154 - Metadata, if supported, shall be transferred as a contiguous part 155 of the logical block. NVMe over Fabrics does not support 156 transferring metadata from a separate buffer; 157 - NVMe over Fabrics does not support PRPs but requires use of SGLs 158 for Admin, I/O, and Fabrics commands. This differs from NVMe over 159 PCIe where SGLs are not supported for Admin commands and are 160 optional for I/O commands; 161 - NVMe over Fabrics does not support Completion Queue flow control. 162 This requires that the host ensures there are available Completion 163 Queue slots before submitting new commands; and 164 - NVMe over Fabrics allows Submission Queue flow control to be 165 disabled if the host and controller agree to disable it. If 166 Submission Queue flow control is disabled, the host is required 167 to ensure that there are available Submission Queue slots before 168 submitting new commands. 170 NVMe over Fabrics requires the underlying NVMe Transport to provide 171 reliable NVMe command and data delivery. An NVMe Transport is an 172 abstract protocol layer independent of any physical interconnect 173 properties. An NVMe Transport may expose a memory model, a message 174 model, or a combination of the two. A memory model is one in which 175 commands, responses and data are transferred between fabric nodes 176 by performing explicit memory read and write operations while a 177 message model is one in which only messages containing command 178 capsules, response capsules, and data are sent between fabric nodes. 180 The only memory model NVMe Transport supported by NVMe [NVME] is 181 PCI Express, as defined in the NVMe Base specification. While 182 differences exist between NVMe over Fabrics and NVMe over PCIe 183 implementations, both implement the same architecture and command 184 sets. But NVMe SCSI Translation reference is only using the NVMe 185 over Fabrics not the memory model. NVMe over Fabrics utilizes the 186 protocol layering shown in Figure 1. The native fabric 187 communication services and the Fabric Protocol and Physical Fabric 188 layers in Figure 1 are outside the scope of this specification. 190 +-------------------+ 191 | pNFS host SCSI | 192 | layout over NVMe | 193 +---------+---------+ 194 | 195 v 196 +-------------------+ 197 | NVMe over Fabrics | 198 +---------+---------+ 199 | 200 v 201 +-------------------+ 202 | Transport Binding | 203 +---------+---------+ 204 | 205 v 206 +--------------------+ 207 | NVMe Transport svc | 208 +---------+----------+ 209 | 210 v 211 +-------------------+ 212 | NVMe Transport | 213 +---------+---------+ 214 | 215 v 216 +--------------------+ 217 | Fabric Protocol | 218 +---------+----------+ 219 | 220 v 221 +-------------------+ 222 | Physical Fabric | 223 +---------+---------+ 224 | 225 v 226 +------------------------+ 227 | pNFS SCSI layout | 228 | server/NVMe controller | 229 +------------------------+ 230 Figure 1: pNFS SCSI over NVMe over Fabrics Layering 232 An NVM subsystem port may support multiple NVMe Transports if more 233 than one NVMe Transport binding specifications exist for the 234 underlying fabric (e.g., an NVM subsystem port identified by a 235 Port ID may support both iWARP and RoCE). This draft is also 236 defining NVMe binding implementation that uses the Transport 237 type of RDMA Transport. The RDMA Transport is RDMA Provider 238 agnostic. The diagram in Figure 2 illustrates the layering of 239 the RDMA Transport and common RDMA providers (iWARP, InfiniBand, 240 and RoCE) within the host and NVM subsystem. 242 +-------------------+ 243 | NVMe Host | 244 +---------+---------+ 245 | RDMA Transport | 246 +--------+---+------------+--+---------+ 247 | iWARP | Infiniband | RoCE | 248 +------------+-----++-----+------------+ 249 || RDMA Fabric 250 vv 251 +------------+------------+--+---------+ 252 | iWARP | Infiniband | RoCE | 253 +---------+--+------------+---+--------+ 254 | RDMA Transport | 255 +-------------------+ 256 | NVM Subsystem | 257 +-------------------+ 258 Figure 2: RDMA Transport Protocol Layers 260 NVMe over Fabrics allows multiple hosts to connect to different 261 controllers in the NVM subsystem through the same port. All other 262 aspects of NVMe over Fabrics multi-path I/O and namespace sharing 263 are equivalent to that defined in the NVMe Base specification. 265 An association is established between a host and a controller when 266 the host connects to a controller's Admin Queue using the Fabrics 267 Connect command. Within the Connect command, the host specifies 268 the Host NQN, NVM Subsystem NQN, Host Identifier, and may request a 269 specific Controller ID or may request a connection to any available 270 controller. The host is the pNFS client and the controller is the 271 NFSv4 server. The pNFS clients connect to the server using different 272 network protocols and different transports excluding PCIe direct 273 connection. While an association exists between a host and 274 a controller, only that host may establish connections with I/O 275 Queues of that controller. 277 NVMe over Fabrics supports both fabric secure channel and NVMe 278 in-band authentication. An NVM subsystem may require a host to 279 use fabric secure channel, NVMe in-band authentication, or both. 280 The Discovery Service indicates if fabric secure channel shall be 281 used for an NVM subsystem. The Connect response indicates if NVMe 282 in-band authentication shall be used with that controller. For 283 SCSI over NVMe over Fabrics only the in-band authentication model 284 will be used as the fabric secure channel is only supported with 285 PCIe transport memory model not supported by SCSI layout protocol. 287 The pNFS SCSI layout uses the Device Identification VPD page (page 288 code 0x83) from [SPC4] to identify the devices used by a layout. 289 There are several ways to build SCSI Device 291 2.1. Volume Identification 293 Identification descriptors from NVMe Identify data included in the 294 Controller Identify Attributes specific to NVMe over Fabrics 295 specified in the Identify Controller fields in Section 4.1 of 296 [NVMEoF]. This document uses a subset of this information to 297 identify LUs backing pNFS SCSI layouts. 299 To be used as storage devices for the pNFS SCSI layout, NVMe 300 devices MUST support the EUI-64 [RFC8154] value in the 301 Identify Namespace data, as the methods based on the Serial 302 Number for legacy devices might not be suitable for unique 303 addressing needs and thus MUST NOT be used. UUID identification 304 can be added by 305 using a large enough enum value to avoid conflict with whatever 306 T10 might do in a future version of the SCSI [SBC3] standard (the 307 underlying SCSI field in SPC is 4 bits, so an enum value of 32 308 in this draft MUST be used). For NVMe, these identifiers need to 309 be obtained via the Namespace Identification Descriptors in NVMe 310 1.4 (returned by the Identify command with the CNS field set to 311 03h). 313 2.2. Client Fencing 315 The SCSI layout uses Persistent Reservations to provide client 316 fencing. For this both the MDS and the Clients have to register a 317 key with the storage device, and the MDS has to create a reservation 318 on the storage device. The pNFS SCSI protocol implements fencing 319 using persistent reservations (PRs), similar to the fencing method 320 used by existing shared disk file systems. To allow fencing 321 individual systems, each system MUST use a unique persistent 322 reservation key. The following is a full mapping of the required 323 PR IN and PR OUT SCSI command to NVMe commands which MUST be used 324 when using NVMe devices as storage devices for the pNFS SCSI layout. 326 2.2.1. Reservation Key Generation 328 Prior to establishing a reservation on a namespace, a host shall 329 become a registrant of that namespace by registering a reservation 330 key. This reservation key may be used by the host as a means of 331 identifying the registrant (host), authenticating the registrant, 332 and preempting a failed or uncooperative registrant. This document 333 assigns the burden to generate unique keys to the MDS, which MUST 334 generate a key for itself before exporting a volume and a key for 335 each client that accesses SCSI layout volumes. 337 One important difference between SCSI Persistent Reservations 338 and NVMe Reservations is that NVMe reservation keys always apply 339 to all controllers used by a host (as indicated by the NVMe Host 340 Identifier) 341 This behavior is somewhat similar to setting the ALL_TG_PT bit when 342 registering a SCSI Reservation key, but actually guaranteed to 343 work reliably. 345 2.2.2. MDS Registration and Reservation 347 Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the 348 MDS needs to prepare the volume for fencing using NVMeReservations. 349 Registering 350 a reservation key with a namespace creates an association between 351 a host and a namespace. A host that is a registrant of a namespace 352 may use any controller with which that host is associated (i.e., 353 that has the same Host Identifier, refer to [NVME] 354 section-5.21.1.26) to access that namespace as a registrant. 356 2.2.3. Client Registration 358 2.2.3.1 SCSI client 360 Before performing the first I/O to a device returned from a 361 GETDEVICEINFO operation, the client will register the 362 reservation key returned by the MDS with the storage device 363 by issuing a "PERSISTENT RESERVE OUT" command with a service 364 action of REGISTER with the "SERVICE ACTION RESERVATION KEY" set 365 to the reservation key. 367 2.2.3.2 NVMe Client 369 A client registers a reservation key by executing a Reservation 370 Register command (refer to [NVME] section 6.11) on the namespace 371 with the Reservation Register Action (RREGA) field cleared to 372 000b (i.e., Register Reservation Key) and supplying a reservation 373 key in the New Reservation Key (NRKEY) field. A client that is a 374 registrant of a namespace may register the same reservation key 375 value multiple times with the namespace on the same or different 376 controllers. There are no restrictions on the reservation key 377 value used by hosts with different Host Identifiers. 379 2.2.4. Fencing Action 381 2.2.4.1 SCSI client 383 In case of a non-responding client, the MDS fences the client by 384 issuing a "PERSISTENT RESERVE OUT" command with the service action 385 set to "PREEMPT" or "PREEMPT AND ABORT", the "RESERVATION KEY" field 386 set to the server's reservation key, the service action "RESERVATION 387 KEY" field set to the reservation key associated with the non- 388 responding client, and the "TYPE" field set to 8h (Exclusive Access 389 - Registrants Only). 391 2.2.4.2 NVMe Client 393 A host that is a registrant may preempt a reservation and/or 394 registration by executing a Reservation Acquire command (refer to 395 section 6.10), setting the Reservation Acquire Action (RACQA) field 396 to 001b (Preempt), and supplying the current reservation key 397 associated with the host in the Current Reservation Key (CRKEY) 398 field. The CRKEY value shall match that used by the registrant to 399 register with the namespace. If the CRKEY value does not match, 400 then the command is aborted with status Reservation Conflict. 402 If the PRKEY field value does not match that of the current 403 reservation holder and is equal to 0h, then the command is aborted 404 with status Invalid Field in Command. A reservation preempted 405 notification occurs on all controllers in the NVM subsystem that 406 are associated with hosts that have their registrations removed as 407 a result of actions taken in this section except those associated 408 with the host that issued the Reservation Release command. After 409 the MDS preempts a client, all client I/O to the LU fails. 410 The client SHOULD at this point return any layout that refers to 411 the device ID that points to the LU. 413 2.2.5. Client Recovery after a Fence Action 415 A client that detects a NVMe status codes (I/O 416 error) on the storage devices MUST commit all layouts that use the 417 storage device through the MDS, return all outstanding layouts for 418 the device, forget the device ID, and unregister the reservation 419 key. 421 Future GETDEVICEINFO calls MAY refer to the storage device again, 422 in which case the client will perform a new registration based on 423 the key provided. If a reservation holder attempts to obtain a 424 reservation of a different type on a namespace for which that host 425 already is the reservation holder, then the command is aborted with 426 status Reservation Conflict. It is not an error if a reservation 427 holder attempts to obtain a reservation of the same type on a 428 namespace for which that host already is the reservation holder. 430 NVMe over Fabrics [NVMEoF] utilizes the same controller 431 architecture as that defined in the NVMe Base specification [NVME]. 433 This includes using Submission and Completion Queues to execute 434 commands between a host and a controller. Section 8.20 of [NVME] 435 base specifications describes the relationship between a controller 436 (MDS) and a namespace associated to the Clients. In a static 437 controller model used by SCSI layout, controllers that may be 438 allocated to a particular Client may have different state at the 439 time the association is established. 441 2.3. Volatile write caches 443 The Volatile Write Cache Enable (WCE) bit (i.e., bit 00) in 444 the Volatile Write Cache Feature (Feature Identifier 06h) 445 is the Write Cache Enable field in the NVMe Get Features command, 446 see Section-5.21.1.6 of [NVME]. If a write cache is enable on a 447 NVMe device used as a storage device for the pNFS SCSI layout, the 448 MDS MUST ensure to use the NVMe FLUSH command to flush the 449 volatile write cache. If there is no volatile write cache on the 450 server, then attempts to access this NVMe Feature cause errors. 451 The Get Features command specifying the Volatile Write Cache feature 452 identifier is expected to fail with Invalid Field in Command status. 454 3. Security Considerations 456 Since no protocol changes are proposed here, no security 457 considerations apply. But the protocol is assuming that NVMe 458 Authentication commands are implemented in the NVMe 459 Security Protocol as the format of the data to be transferred is 460 dependent on the Security Protocol. Authentication Receive/Response 461 commands return the appropriate data corresponding to an 462 Authentication Send command as defined by the rules of the 463 Security Protocol. As the current draft is only supporting 464 MVMe over fabric In-band protocol the Authentication requirements 465 for security commands are based on the security protocol indicated 466 by the SECP field in the command and DO NOT require authentication 467 when used for NVMe in-band authentication. When used for other 468 purposes, in-band authentication of the commands is required. 470 4. IANA Considerations 471 The document does not require any actions by IANA. 473 5. Normative References 475 [NVME] NVM Express, Inc., "NVM Express Revision 1.4", June 10, 2019. 477 [NVMEoF] "NVM Express over Fabrics Revision 1.1", July 26, 2019 479 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 480 Requirement Levels", March 1997. 482 [RFC8154] Hellwig, C., "Parallel NFS (pNFS) Small Computer System 483 Interface (SCSI) Layout", May 2017. 485 [SBC3] INCITS Technical Committee T10, "SCSI Block Commands-3", 486 ANSI INCITS INCITS 514-2014, ISO/IEC 14776-323, 2014. 488 [SPC4] INCITS Technical Committee T10, "SCSI Primary Commands-4", 489 ANSI INCITS 513-2015, 2015. 491 [FC-NVMe] INCITS Technical Committee T10, "Fibre Channel - 492 Non-Volatile Memory Express", ANSI INCITS 540, 2018. 494 [RFC7525] Sheffer, Y., "Recommendations for Secure Use of Transport 495 Layer Security (TLS) and Datagram Transport Layer Security 496 (DTLS)" alsa known as BCP 195. 498 Author's Address 500 Sorin Faibish 501 Dell EMC 502 228 South Street 503 Hopkinton, MA 01774 504 United States of America 506 Phone: +1 508-249-5745 507 Email: faibish.sorin@dell.com 509 David Black 510 Dell EMC 511 176 South Street 512 Hopkinton, MA 01748 513 United States of America 515 Phone: +1 774-350-9323 516 Email: david.black@dell.com 518 Christoph Hellwig 519 Email: hch@lst.de