idnits 2.17.1 draft-theoleyre-raw-oam-support-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 25, 2020) is 1251 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RAW F. Theoleyre 3 Internet-Draft CNRS 4 Intended status: Standards Track G. Papadopoulos 5 Expires: April 28, 2021 IMT Atlantique 6 G. Mirsky 7 ZTE Corp. 8 October 25, 2020 10 Operations, Administration and Maintenance (OAM) features for RAW 11 draft-theoleyre-raw-oam-support-04 13 Abstract 15 Some critical applications may use a wireless infrastructure. 16 However, wireless networks exhibit a bandwidth of several orders of 17 magnitude lower than wired networks. Besides, wireless transmissions 18 are lossy by nature; the probability that a packet cannot be decoded 19 correctly by the receiver may be quite high. In these conditions, 20 guaranteeing the network infrastructure works properly is 21 particularly challenging, since we need to address some issues 22 specific to wireless networks. This document lists the requirements 23 of the Operation, Administration, and Maintenance (OAM) features 24 recommended to construct a predictable communication infrastructure 25 on top of a collection of wireless segments. This document describes 26 the benefits, problems, and trade-offs for using OAM in wireless 27 networks to achieve Service Level Objectives (SLO). 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on April 28, 2021. 46 Copyright Notice 48 Copyright (c) 2020 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 64 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 65 1.2. Acronyms . . . . . . . . . . . . . . . . . . . . . . . . 5 66 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 5 67 2. Role of OAM in RAW . . . . . . . . . . . . . . . . . . . . . 5 68 2.1. Link concept and quality . . . . . . . . . . . . . . . . 6 69 2.2. Broadcast Transmissions . . . . . . . . . . . . . . . . . 6 70 2.3. Complex Layer 2 Forwarding . . . . . . . . . . . . . . . 7 71 3. Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 7 72 3.1. Information Collection . . . . . . . . . . . . . . . . . 7 73 3.2. Continuity Check . . . . . . . . . . . . . . . . . . . . 7 74 3.3. Connectivity Verification . . . . . . . . . . . . . . . . 7 75 3.4. Route Tracing . . . . . . . . . . . . . . . . . . . . . . 8 76 3.5. Fault Verification/detection . . . . . . . . . . . . . . 8 77 3.6. Fault Isolation/identification . . . . . . . . . . . . . 8 78 4. Administration . . . . . . . . . . . . . . . . . . . . . . . 9 79 4.1. Worst-case metrics . . . . . . . . . . . . . . . . . . . 9 80 4.2. Efficient data retrieval . . . . . . . . . . . . . . . . 10 81 5. Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . 10 82 5.1. Dynamic Resource Reservation . . . . . . . . . . . . . . 11 83 5.2. Reliable Reconfiguration . . . . . . . . . . . . . . . . 11 84 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 85 7. Security Considerations . . . . . . . . . . . . . . . . . . . 11 86 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 11 87 9. Informative References . . . . . . . . . . . . . . . . . . . 11 88 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13 90 1. Introduction 92 Reliable and Available Wireless (RAW) is an effort that extends 93 DetNet to approach end-to-end deterministic performances over a 94 network that includes scheduled wireless segments. In wired 95 networks, many approaches try to enable Quality of Service (QoS) by 96 implementing traffic differentiation so that routers handle each type 97 of packets differently. However, this differentiated treatment was 98 expensive for most applications. 100 Deterministic Networking (DetNet) [RFC8655] has proposed to provide a 101 bounded end-to-end latency on top of the network infrastructure, 102 comprising both Layer 2 bridged and Layer 3 routed segments. Their 103 work encompasses the data plane, OAM, time synchronization, 104 management, control, and security aspects. 106 However, wireless networks create specific challenges. First of all, 107 radio bandwidth is significantly lower than for wired networks. In 108 these conditions, the volume of signaling messages has to be very 109 limited. Even worse, wireless links are lossy: a layer 2 110 transmission may or may not be decoded correctly by the receiver, 111 depending on a broad set of parameters. Thus, providing high 112 reliability through wireless segments is particularly challenging. 114 Wired networks rely on the concept of _links_. All the devices 115 attached to a link receive any transmission. The concept of a link 116 in wireless networks is somewhat different from what many are used to 117 in wireline networks. A receiver may or may not receive a 118 transmission, depending on the presence of a colliding transmission, 119 the radio channel's quality, and the external interference. Besides, 120 a wireless transmission is broadcast by nature: any _neighboring_ 121 device may be able to decode it. The document includes detailed 122 information on what the implications for the OAM features are. 124 Last but not least, radio links present volatile characteristics. If 125 the wireless networks use an unlicensed band, packet losses are not 126 anymore temporally and spatially independent. Typically, links may 127 exhibit a very bursty characteristic, where several consecutive 128 packets may be dropped. Thus, providing availability and reliability 129 on top of the wireless infrastructure requires specific Layer 3 130 mechanisms to counteract these bursty losses. 132 Operations, Administration, and Maintenance (OAM) Tools are of 133 primary importance for IP networks [RFC7276]. It defines a toolset 134 for fault detection, isolation, and performance measurement. 136 The primary purpose of this document is to detail the specific 137 requirements of the OAM features recommended to construct a 138 predictable communication infrastructure on top of a collection of 139 wireless segments. This document describes the benefits, problems, 140 and trade-offs for using OAM in wireless networks to provide 141 availability and predictability. 143 In this document, the term OAM will be used according to its 144 definition specified in [RFC6291]. We expect to implement an OAM 145 framework in RAW networks to maintain a real-time view of the network 146 infrastructure, and its ability to respect the Service Level 147 Objectives (SLO), such as delay and reliability, assigned to each 148 data flow. 150 1.1. Terminology 152 We re-use here the same terminology as [detnet-oam]: 154 o OAM entity: a data flow to be controlled; 156 o Maintenance End Point (MEP): OAM devices crossed when entering/ 157 exiting the network. In RAW, it corresponds mostly to the source 158 or destination of a data flow. OAM message can be exchanges 159 between two MEPs; 161 o Maintenance Intermediate endPoint (MIP): OAM devices along the 162 flow; OAM messages can be exchanged between a MEP and a MIP; 164 o control/data plane: while the control plane expects to configure 165 and control the network (long-term), the data plane takes the 166 individual decision; 168 o passive / active methods (as defined in [RFC7799]): active methods 169 send additionnal control information (inserting novel fields, 170 generating novel control packets). Passive methods infer 171 information just by observing unmodified existing flows. 173 o active methods may implement one of these two strategies: 175 * In-band: control information follows the same path as the data 176 packets. In other words, a failure in the data plane may 177 prevent the control information to reach the destination (e.g., 178 end-device or controller). 180 * out-of-band: control information is sent separately from the 181 data packets. Thus, the behavior of control vs. data packets 182 may differ; 184 We also adopt the following terminology, which is particularly 185 relevant for RAW segments. 187 o piggybacking vs. dedicated control packets: control information 188 may be encapsulated in specific (dedicated) control packets. 189 Alternatively, it may be piggybacked in existing data packets, 190 when the MTU is larger than the actual packet length. 191 Piggybacking makes specifically sense in wireless networks: the 192 cost (bandwidth and energy) is not linear with the packet size. 194 o router-over vs. mesh under: a control packet is either forwarded 195 directly to the layer-3 next hop (mesh under) or handled hop-by- 196 hop by each router. While the latter option consumes more 197 resource, it allows to collect additionnal intermediary 198 information, particularly relevant in wireless networks. 200 o Defect: a temporary change in the network (e.g., a radio link 201 which is broken due to a mobile obstacle); 203 o Fault: a definite change which may affect the network performance, 204 e.g., a node runs out of energy. 206 1.2. Acronyms 208 OAM Operations, Administration, and Maintenance 210 DetNet Deterministic Networking 212 SLO Service Level Objective 214 QoS Quality of Service 216 SNMP Simple Network Management Protocol 218 SDN Software-Defined Network 220 1.3. Requirements Language 222 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 223 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 224 "OPTIONAL" in this document are to be interpreted as described in BCP 225 14 [RFC2119] [RFC8174] when, and only when, they appear in all 226 capitals, as shown here. 228 2. Role of OAM in RAW 230 RAW networks expect to make the communications reliable and 231 predictable on top of a wireless network infrastructure. Most 232 critical applications will define an SLO to be required for the data 233 flows it generates. RAW considers network plane protocol elements 234 such as OAM to improve the RAW operation at the service and the 235 forwarding sub-layers. 237 To respect strict guarantees, RAW relies on an orchestrator able to 238 monitor and maintain the network. Typically, a Software-Defined 239 Network (SDN) controller is in charge of scheduling the transmissions 240 in the deployed network, based on the radio link characteristics, SLO 241 of the flows, the number of packets to forward. Thus, resources have 242 to be provisioned a priori to handle any defect. OAM represents the 243 core of the pre-provisioning process and maintains the network 244 operational by updating the schedule dynamically. 246 Fault-tolerance also assumes that multiple paths have to be 247 provisioned so that an end-to-end circuit keeps on existing whatever 248 the conditions. The Packet Replication and Elimination Function 249 ([PREF-draft]) on a node is typically controlled by a central 250 controller/orchestrator. OAM mechanisms can be used to monitor that 251 PREOF is working correctly on a node and within the domain. 253 To be energy-efficient, reserving some dedicated out-of-band 254 resources for OAM seems idealistic, and only in-band solutions are 255 considered here. 257 RAW supports both proactive and on-demand troubleshooting. 259 The specific characteristics of RAW are discussed below. 261 2.1. Link concept and quality 263 In wireless networks, a _link_ does not exist physically. A common 264 convention is to define a wireless link as a pair of devices that 265 have a non-null probability of exchanging a packet that the receiver 266 can decode. Similarly, we designate as *neighbor* any device with a 267 radio link with a specific transmitter. 269 Each wireless link is associated with a link quality, often measured 270 as the Packet Delivery Ratio (PDR), i.e., the probability that the 271 receiver can decode the packet correctly. It is worth noting that 272 this link quality depends on many criteria, such as the level of 273 external interference, the presence of concurrent transmissions, or 274 the radio channel state. This link quality is even time-variant. 276 2.2. Broadcast Transmissions 278 In modern switching networks, the unicast transmission is delivered 279 uniquely to the destination. Wireless networks are much closer to 280 the ancient *shared access* networks. Practically, unicast and 281 broadcast frames are handled similarly at the physical layer. The 282 link layer is just in charge of filtering the frames to discard 283 irrelevant receptions (e.g., different unicast MAC address). 285 However, contrary to wired networks, we cannot be sure that a packet 286 is received by *all* the devices attached to the layer-2 segment. It 287 depends on the radio channel state between the transmitter(s) and the 288 receiver(s). In particular, concurrent transmissions may be possible 289 or not, depending on the radio conditions (e.g., do the different 290 transmitters use a different radio channel or are they sufficiently 291 spatially separated?) 293 2.3. Complex Layer 2 Forwarding 295 Multiple neighbors may receive a transmission. Thus, anycast layer-2 296 forwarding helps to maximize the reliability by assigning multiple 297 receivers to a single transmission. That way, the packet is lost 298 only if *none* of the receivers decode it. Practically, it has been 299 proven that different neighbors may exhibit very different radio 300 conditions, and that reception independency may hold for some of them 301 [anycast-property]. 303 3. Operation 305 OAM features will enable RAW with robust operation both for 306 forwarding and routing purposes. 308 3.1. Information Collection 310 The model to exchange information should be the same as for detnet 311 network, for the sake of inter-operability. YANG may typically 312 fulfill this objective. 314 However, RAW networks imply specific constraints (e.g., low 315 bandwidth, packet losses, cost of medium access) that may require to 316 minimize the volume of information to collect. Thus, we discuss in 317 Section 4.2 the different ways to collect information, i.e., transfer 318 physically the OAM information from the emitter to the receiver. 320 3.2. Continuity Check 322 Similarly to detnet, we need to verify that the source and the 323 destination are connected (at least one valid path exists) 325 3.3. Connectivity Verification 327 As in detnet, we have to verify the absence of misconnection. We 328 will focus here on the RAW specificities. 330 Because of radio transmissions' broadcast nature, several receivers 331 may be active at the same time to enable anycast Layer 2 forwarding. 332 Thus, the connectivity verification must test any combination. We 333 also consider priority-based mechanisms for anycast forwarding, i.e., 334 all the receivers have different probabilities of forwarding a 335 packet. To verify a delay SLO for a given flow, we must also 336 consider all the possible combinations, leading to a probability 337 distribution function for end-to-end transmissions. If this 338 verification is implemented naively, the number of combinations to 339 test may be exponential and too costly for wireless networks with low 340 bandwidth. 342 3.4. Route Tracing 344 Wireless networks are meshed by nature: we have many redundant radio 345 links. These meshed networks are both an asset and a drawback: while 346 several paths exist between two endpoints, and we should choose the 347 most efficient one(s), concerning specifically the reliability, and 348 the delay. 350 Thus, multipath routing can be considered to make the network fault- 351 tolerant. Even better, we can exploit the broadcast nature of 352 wireless networks to exploit meshed multipath routing: we may have 353 multiple Maintenance Intermediate Endpoints (MIE) for each hop in the 354 path. In that way, each Maintenance Intermediate Endpoint has 355 several possible next hops in the forwarding plane. Thus, all the 356 possible paths between two maintenance endpoints should be retrieved, 357 which may quickly become untractable if we apply a naive approach. 359 3.5. Fault Verification/detection 361 Wired networks tend to present stable performances. On the contrary, 362 wireless networks are time-variant. We must consequently make a 363 distinction between _normal_ evolutions and malfunction. 365 3.6. Fault Isolation/identification 367 The network has isolated and identified the cause of the fault. 368 While detnet already expects to identify malfunctions, some problems 369 are specific to wireless networks. We must consequently collect 370 metrics and implement algorithms tailored for wireless networking. 372 For instance, the decrease in the link quality may be caused by 373 several factors: external interference, obstacles, multipath fading, 374 mobility. It it fundamental to be able to discriminate the different 375 causes to make the right decision. 377 4. Administration 379 The RAW network has to expose a collection of metrics to support an 380 operator making proper decisions, including: 382 o Packet losses: the time-window average and maximum values of the 383 number of packet losses have to be measured. Many critical 384 applications stop to work if a few consecutive packets are 385 dropped; 387 o Received Signal Strength Indicator (RSSI) is a very common metric 388 in wireless to denote the link quality. The radio chipset is in 389 charge of translating a received signal strength into a normalized 390 quality indicator; 392 o Delay: the time elapsed between a packet generation / enqueuing 393 and its reception by the next hop; 395 o Buffer occupancy: the number of packets present in the buffer, for 396 each of the existing flows. 398 These metrics should be collected per device, virtual circuit, and 399 path, as detnet already does. However, we have to face in RAW to a 400 finer granularity: 402 o per radio channel to measure, e.g., the level of external 403 interference, and to be able to apply counter-measures (e.g., 404 blacklisting). 406 o per link to detect misbehaving link (assymetrical link, 407 fluctuating quality). 409 o per resource block: a collision in the schedule is particularly 410 challenging to identify in radio networks with spectrum reuse. In 411 particular, a collision may not be systematic (depending on the 412 radio characteristics and the traffic profile) 414 4.1. Worst-case metrics 416 RAW inherits the same requirements as detnet: we need to know the 417 distribution of a collection of metrics. However, wireless networks 418 are know to be highly variable. Changes may be frequent, and may 419 exhibit a periodical pattern. Collecting and analyzing this amount 420 of measurements is challenging. 422 Wireless networks are known to be lossy, and RAW has to implement 423 strategies to improve reliability on top of unreliable links. Hybrid 424 Automatic Repeat reQuest (ARQ) has typically to enable 425 retransmissions based on the end-to-end reliability and latency 426 requirements. 428 4.2. Efficient data retrieval 430 We have to minimize the number of statistics / measurements to 431 exchange: 433 o energy efficiency: low-power devices have to limit the volume of 434 monitoring information since every bit consumes energy. 436 o bandwidth: wireless networks exhibit a bandwidth significantly 437 lower than wired, best-effort networks. 439 o per-packet cost: it is often more expensive to send several 440 packets instead of combining them in a single link-layer frame. 442 In conclusion, we have to take care of power and bandwidth 443 consumption. The following techniques aim to reduce the cost of such 444 maintenance: 446 on-path collection: some control information is inserted in the 447 data packets if they do not fragment the packet (i.e., the MTU is 448 not exceeded). Information Elements represent a standardized way 449 to handle such information; 451 flags/fields: we have to set-up flags in the packets to monitor to 452 be able to monitor the forwarding process accurately. A sequence 453 number field may help to detect packet losses. Similarly, path 454 inference tools such as [ipath] insert additional information in 455 the headers to identify the path followed by a packet a 456 posteriori. 458 hierarchical monitoring; localized and centralized mechanisms have 459 to be combined together. Typically, a local mechanism should 460 contiuously monitor a set of metrics and trigger distant OAM 461 exchances only when a fault is detected (but possibly not 462 identified). For instance, local temporary defects must not 463 trigger expensive OAM transmissions. 465 5. Maintenance 467 RAW needs to implement a self-healing and self-optimization approach. 468 The network must continuously retrieve the state of the network, to 469 judge about the relevance of a reconfiguration, quantifying: 471 the cost of the sub-optimality: resources may not be used 472 optimally (e.g., a better path exists); 473 the reconfiguration cost: the controller needs to trigger some 474 reconfigurations. For this transient period, resources may be 475 twice reserved, and control packets have to be transmitted. 477 Thus, reconfiguration may only be triggered if the gain is 478 significant. 480 5.1. Dynamic Resource Reservation 482 Wireless networks exhibit time-variant characteristics. Thus, the 483 network has to provide additional resources along the path to fit the 484 worst-case performance. This time-variant characteristics make the 485 resource reservation very challenging: over-reaction waste radio and 486 energy resources. Inversely, under-reaction jeopardize the network 487 operations, and some SLO may be violated. 489 5.2. Reliable Reconfiguration 491 Wireless networks are known to be lossy. Thus, commands may be 492 received or not by the node to reconfigure. Unfortunately, 493 inconsistent states may create critical misconfigurations, where 494 packets may be lost along a path because it has not been properly 495 configured. 497 We have to propose mechanisms to guarantee that the network state is 498 always consistent, even if some control packets are lost. Timeouts 499 and retransmissions are not sufficient since the reconfiguration 500 duration would be, in that case, unbounded. 502 6. IANA Considerations 504 This document has no actionable requirements for IANA. This section 505 can be removed before the publication. 507 7. Security Considerations 509 This section will be expanded in future versions of the draft. 511 8. Acknowledgments 513 TBD 515 9. Informative References 517 [anycast-property] 518 Teles Hermeto, R., Gallais, A., and F. Theoleyre, "Is 519 Link-Layer Anycast Scheduling Relevant for IEEE 520 802.15.4-TSCH Networks?", 2019, 521 . 523 [detnet-oam] 524 Theoleyre, F., Papadopoulos, G. Z., Mirsky, G., and C. J. 525 Bernardos, "Operations, Administration and Maintenance 526 (OAM) features for detnet", 2020, 527 . 530 [ipath] Gao, Y., Dong, W., Chen, C., Bu, J., Wu, W., and X. Liu, 531 "iPath: path inference in wireless sensor networks.", 532 2016, . 534 [PREF-draft] 535 Thubert, P., Eckert, T., Brodard, Z., and H. Jiang, "BIER- 536 TE extensions for Packet Replication and Elimination 537 Function (PREF) and OAM", 2018, 538 . 541 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 542 Requirement Levels", BCP 14, RFC 2119, 543 DOI 10.17487/RFC2119, March 1997, 544 . 546 [RFC6291] Andersson, L., van Helvoort, H., Bonica, R., Romascanu, 547 D., and S. Mansfield, "Guidelines for the Use of the "OAM" 548 Acronym in the IETF", BCP 161, RFC 6291, 549 DOI 10.17487/RFC6291, June 2011, 550 . 552 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 553 Weingarten, "An Overview of Operations, Administration, 554 and Maintenance (OAM) Tools", RFC 7276, 555 DOI 10.17487/RFC7276, June 2014, 556 . 558 [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with 559 Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, 560 May 2016, . 562 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 563 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 564 May 2017, . 566 [RFC8655] Finn, N., Thubert, P., Varga, B., and J. Farkas, 567 "Deterministic Networking Architecture", RFC 8655, 568 DOI 10.17487/RFC8655, October 2019, 569 . 571 Authors' Addresses 573 Fabrice Theoleyre 574 CNRS 575 Building B 576 300 boulevard Sebastien Brant - CS 10413 577 Illkirch - Strasbourg 67400 578 FRANCE 580 Phone: +33 368 85 45 33 581 Email: theoleyre@unistra.fr 582 URI: http://www.theoleyre.eu 584 Georgios Z. Papadopoulos 585 IMT Atlantique 586 Office B00 - 102A 587 2 Rue de la Chataigneraie 588 Cesson-Sevigne - Rennes 35510 589 FRANCE 591 Phone: +33 299 12 70 04 592 Email: georgios.papadopoulos@imt-atlantique.fr 594 Greg Mirsky 595 ZTE Corp. 597 Email: gregimirsky@gmail.com