idnits 2.17.1 draft-theoleyre-raw-oam-support-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (April 11, 2020) is 1476 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RAW F. Theoleyre 3 Internet-Draft CNRS 4 Intended status: Standards Track G. Papadopoulos 5 Expires: October 13, 2020 IMT Atlantique 6 G. Mirsky 7 ZTE Corp. 8 April 11, 2020 10 Operations, Administration and Maintenance (OAM) features for RAW 11 draft-theoleyre-raw-oam-support-02 13 Abstract 15 Some critical applications may use a wireless infrastructure. 16 However, wireless networks exhibit a bandiwidth of several orders of 17 magnitude lower than wired networks. Besides, wireless transmissions 18 are lossy by nature; the probability that a packet cannot be decoded 19 correctly by the receiver may be quite high. In these conditions, 20 guaranteeing the network infrastructure works properly is 21 particularly challenging, since we need to address some issues 22 specific to wireless networks. This document lists the requirements 23 of the Operation, Administration, and Maintenance (OAM) features 24 recommended to construct a predictable communication infrastructure 25 on top of a collection of wireless segments. This document describes 26 the benefits, problems, and trade-offs for using OAM in wireless 27 networks to achieve Service Level Objectives (SLO). 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on October 13, 2020. 46 Copyright Notice 48 Copyright (c) 2020 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 64 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 65 1.2. Acronyms . . . . . . . . . . . . . . . . . . . . . . . . 4 66 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 4 67 2. Role of OAM in RAW . . . . . . . . . . . . . . . . . . . . . 5 68 3. Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 5 69 3.1. Information Collection . . . . . . . . . . . . . . . . . 5 70 3.2. Continuity Check . . . . . . . . . . . . . . . . . . . . 5 71 3.3. Connectivity Verification . . . . . . . . . . . . . . . . 6 72 3.4. Route Tracing . . . . . . . . . . . . . . . . . . . . . . 6 73 3.5. Fault Verification/detection . . . . . . . . . . . . . . 6 74 3.6. Fault Isolation/identification . . . . . . . . . . . . . 7 75 4. Administration . . . . . . . . . . . . . . . . . . . . . . . 7 76 4.1. Collection of metrics . . . . . . . . . . . . . . . . . . 8 77 4.2. Worst-case metrics . . . . . . . . . . . . . . . . . . . 8 78 4.3. Energy efficiency constraint . . . . . . . . . . . . . . 8 79 5. Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . 9 80 5.1. Multipath Routing . . . . . . . . . . . . . . . . . . . . 9 81 5.2. Replication / Elimination . . . . . . . . . . . . . . . . 9 82 5.3. Resource Reservation . . . . . . . . . . . . . . . . . . 10 83 5.4. Soft transition after reconfiguration . . . . . . . . . . 10 84 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 85 7. Security Considerations . . . . . . . . . . . . . . . . . . . 10 86 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10 87 9. Informative References . . . . . . . . . . . . . . . . . . . 10 88 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 90 1. Introduction 92 Reliable and Available Wireless (RAW) is an effort that extends 93 DetNet to approach end-to-end deterministic performances over a 94 network that includes scheduled wireless segments. In wired 95 networks, many approaches to Quality of Service (QoS) tried to 96 implement traffic differentiation so that routers handle differently 97 each type of packets. However, this differentiated treatment was 98 expensive for most applications. 100 Deterministic Networking (DetNet) [RFC8655] has proposed to provide a 101 bounded end-to-end latency on top of the network infrastructure, 102 comprising both Layer 2 bridged and Layer 3 routed segments. Their 103 work encompasses the data plane, OAM, time synchronization, 104 management, control, and security aspects. 106 However, wireless networks create specific challenges. First of all, 107 radio bandwdidth is significantly lower than for wired networks. In 108 these conditions, the volume of signaling messages has to be very 109 limited. Even worse, wireless links are lossy: a layer 2 110 transmission may or may not be decoded correctly by the receiver, 111 depending on a large set of parameters. Thus, providing high 112 reliability through only wireless segments only is particularly 113 challenging. 115 Last but not least, radio links present very unstable 116 characteristics. If the wireless networks use an unlicensed band, 117 packet losses are not anymore temporally and spatially independent. 118 Typically, links may exhibit a very bursty characteristic, where 119 several consecutive packets may be dropped. Thus, providing 120 availability and reliability on top of the wireless infrastructure 121 requires specific layer 3 mechanisms to counteract these bursty 122 losses. 124 Operations, Administration, and Maintenance (OAM) Tools are of 125 primary importance for IP networks [RFC7276]. It defines a toolset 126 for fault detection and isolation, and for performance measurement. 128 The main purpose of this document is to detail the specific 129 requirements of the OAM features recommended to construct a 130 predictable communication infrastructure on top of a collection of 131 wireless segments. This document describes the benefits, problems, 132 and trade-offs for using OAM in wireless networks to provide 133 availability and predictability. 135 In this document, the term OAM will be used according to its 136 definition specified in [RFC6291]. We expect to implement an OAM 137 framework in RAW networks to maintain a real-time view of the network 138 infrastructure, and its ability to respect the Service Level 139 Objectives (SLO), such as delay and reliability, assigned to each 140 data flow. 142 1.1. Terminology 144 o OAM entity: a data flow to be controlled; 146 o Maintenance End Point (MEP): OAM devices crossed when entering/ 147 exiting the network. In RAW, it corresponds mostly to the source 148 or destination of a data flow. OAM message can be exchanges 149 between two MEPs; 151 o Maintenance Intermediate end Point (MIP): OAM devices along the 152 flow; OAM messages can be exchanged between a MEP and a MIP; 154 o Defect: a temporary change in the network (e.g. a radio link which 155 is broken due to a mobile obstacle); 157 o Fault: a definite change which may affect the network performance, 158 e.g. a node runs out of energy. 160 1.2. Acronyms 162 OAM Operations, Administration, and Maintanence 164 DetNet Deterministic Networking 166 SLO Service Level Objective 168 QoS Quality of Service 170 SNMP Simple Network Management Protocol 172 SDN Software Defined Network 174 1.3. Requirements Language 176 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 177 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 178 "OPTIONAL" in this document are to be interpreted as described in BCP 179 14 [RFC2119] [RFC8174] when, and only when, they appear in all 180 capitals, as shown here. 182 2. Role of OAM in RAW 184 RAW networks expect to make the communications reliable and 185 predictable on top of a wireless network infrastructure. Most 186 critical applications will define an SLO to be required for the data 187 flows it generates. RAW considers network plane protocol elements 188 such as OAM to improve the RAW operation at the service and the 189 forwarding sub-layers. 191 To respect strict guarantees, RAW relies on an orchestrator able to 192 monitor and maintain the network. Typically, a Software Defined 193 Network (SDN) controller is in charge of scheduling the transmissions 194 in the deployed network, based on the radio link characteristics, SLO 195 of the flows, the number of packets to forward. Thus, resources have 196 to be provisioned a priori to handle any defect. OAM represents the 197 core of the over provisioning process, and maintains the network 198 operational by updating the schedule dynamically. 200 Fault-tolerance also assumes that multiple paths have to be 201 provisioned so that an end-to-end circuit keeps on existing whatever 202 the conditions. The replication/elimination processes (PREOF) on a 203 node is typically controlled by the central controller/orchestrator. 204 OAM is in charge of controlling that PREOF is working properly on a 205 node and within the domain. 207 To be energy-efficient, reserving some dedicated out-of-band 208 resources for OAM seems idealistic, and only in-band solutions are 209 considered here. 211 RAW supports both proactive and on-demand troubleshooting. 213 3. Operation 215 OAM features will enable RAW with robust operation both for 216 forwarding and routing purposes. 218 3.1. Information Collection 220 Several solutions (e.g., Simple Network Management Protocol (SNMP), 221 YANG-based data models) are already in charge of collecting the 222 statistics. That way, we can encapsulate these statistics in 223 specific monitoring packets, to send them to the controller. 225 3.2. Continuity Check 227 We need to verify that two endpoints are connected. In other words, 228 there exists "one" way to deliver the packets between two endpoints A 229 and B. 231 3.3. Connectivity Verification 233 Additionally, to the Continuity Check, we have to verify the 234 connectivity. This verification considers additional constraints, 235 i.e., the absence of misconnection. 237 In particular, the resources have to be reserved by a given flow, and 238 no packets from other flows steal the corresponding resources. 239 Similarly, the destination does not receive packets from different 240 flows through its interface. 242 It is worth noting that the control and data packets may not follow 243 the same path, and the connectivity verification has to be conducted 244 in-band without impacting the data traffic. Test packets must share 245 the fate with the monitored data traffic without introducing 246 congestion in normal network conditions. 248 3.4. Route Tracing 250 Ping and traceroute are two very common tools for diagnostic. They 251 help to identify a subset of the list of routers in the route. 252 However, to be predictable, resources are reserved per flow in RAW. 253 Thus, we need to define route tracing tools able to track the route 254 for a specific flow. 256 Wireless networks are meshed by nature: we have many redundant radio 257 links. These meshed networks are both an asset and a drawback: while 258 several paths exist between two endpoints, we should choose the most 259 efficient one(s), concerning specifically the reliability, and the 260 delay. 262 Thus, multipath routing can be considered to make the network fault- 263 tolerant. Even better, we can exploit the broadcast nature of 264 wireless networks to exploit meshed multipath routing: we may have 265 multiple Maintenance Intermediate Endpoints for each hop in the path. 266 In that way, each Maintenance Intermediate Endpoint has several 267 possible next hops in the forwarding plane. Thus, all the possible 268 paths between two maintenance endpoints should be retrieved. 270 3.5. Fault Verification/detection 272 RAW expects to operate fault-tolerant networks. Thus, we need 273 mechanisms able to detect faults, before they impact the network 274 performance. 276 The network has to detect when a fault occurred, i.e., the network 277 has deviated from its expected behavior. While the network must 278 report an alarm, the cause may not be identified precisely. For 279 instance, the end-to-end reliability has decreased significantly, or 280 a buffer overflow occurs. 282 3.6. Fault Isolation/identification 284 The network has isolated and identified the cause of the fault. For 285 instance, the quality of a specific link has decreased, requiring 286 more retransmissions, or the level of external interference has 287 locally increased. 289 4. Administration 291 The network has to expose a collection of metrics to support an 292 operator making proper decisions, including: 294 o Packet losses: the time-window average and maximum values of the 295 number of packet losses have to be measured. Many critical 296 applications stop to work if a few consecutive packets are 297 dropped; 299 o Received Signal Strength Indicator (RSSI) is a very common metric 300 in wireless to denote the link quality. The radio chipset is in 301 charge of translating a received signal strength into a normalized 302 quality indicator; 304 o Delay: the time elapsed between a packet generation / enqueuing 305 and its reception by the next hop; 307 o Buffer occupancy: the number of packets present in the buffer, for 308 each of the existing flows. 310 These metrics should be collected: 312 o per virtual circuit to measure the end-to-end performance for a 313 given flow. Each of the paths has to be isolated in multipath 314 routing strategies; 316 o per radio channel to measure, e.g., the level of external 317 interference, and to be able to apply counter-measures (e.g. 318 blacklisting) 320 o per device to detect misbehaving node, when it relays the packets 321 of several flows. 323 4.1. Collection of metrics 325 We have to minimize the number of statistics / measurements to 326 exchange: 328 o energy efficiency: low-power devices have to limit the volume of 329 monitoring information since every bit consumes energy. 331 o bandwidth: wireless networks exhibit a bandwidth significantly 332 lower than wired, best-effort networks. 334 o per-packet cost: it is often more expensive to send several 335 packets instead of combining them in a single link-layer frame. 337 Thus, localized and centralized mechanisms have to be combined 338 together, and additional control packets have to be triggered only 339 after a fault detection. 341 4.2. Worst-case metrics 343 RAW aims to enable real-time communications on top of a heterogeneous 344 architecture. Since wireless networks are known to be lossy, RAW has 345 to implement strategies to improve reliability on top of unreliable 346 links. Hybrid Automatic Repeat reQuest (ARQ) has typically to enable 347 retransmissions based on the end-to-end reliability and latency 348 requirements. 350 To make correct decisions, the controller needs to know the 351 distribution of packet losses for each flow, and each hop of the 352 paths. In other words, the average end-to-end statistics are not 353 enough. They must allow the controller to predict the worst-case. 355 4.3. Energy efficiency constraint 357 RAW targets also low-power wireless networks, where energy represents 358 a key constraint. Thus, we have to take care of power and bandwidth 359 consumption. The following techniques aim to reduce the cost of such 360 maintenance: 362 piggybacking: some control information are inserted in the data 363 packets if they do not fragment the packet (i.e., the MTU is not 364 exceeded). Information Elements represent a standardized way to 365 handle such information; 367 flags/fields: we have to set-up flags in the packets to monitor to 368 be able to monitor the forwarding process accurately. A sequence 369 number field may help to detect packet losses. Similarly, path 370 inference tools such as [ipath] insert additional information in 371 the headers to identify the path followed by a packet a 372 posteriori. 374 5. Maintenance 376 RAW needs to implement a self-healing and self-optimization approach. 377 The network must continuously retrieve the state of the network, to 378 judge about the relevance of a reconfiguration, quantifying: 380 the cost of the sub-optimality: resources may not be used 381 optimally (e.g., a better path exists); 383 the reconfiguration cost: the controller needs to trigger some 384 reconfigurations. For this transient period, resources may be 385 twice reserved, and control packets have to be transmitted. 387 Thus, reconfiguration may only be triggered if the gain is 388 significant. 390 5.1. Multipath Routing 392 To be fault-tolerant, several paths can be reserved between two 393 maintenance endpoints. They must be node-disjoint so that a path can 394 be available at any time. 396 5.2. Replication / Elimination 398 When multiple paths are reserved between two maintenance endpoints, 399 they may decide to replicate the packets to introduce redundancy, and 400 thus to alleviate transmission errors and collisions. For instance, 401 in Figure 1, the source node S is transmitting the packet to both 402 parents, nodes A and B. Each maintenance endpoint will decide to 403 trigger the replication/elimination process when a set of metrics 404 passes through a threshold value. 406 ===> (A) => (C) => (E) === 407 // \\// \\// \\ 408 source (S) //\\ //\\ (R) (root) 409 \\ // \\ // \\ // 410 ===> (B) => (D) => (F) === 412 Figure 1: Packet Replication: S transmits twice the same data packet, 413 to its DP (A) and to its AP (B). 415 5.3. Resource Reservation 417 Because the QoS criteria associated with a path may degrade, the 418 network has to provision additional resources along the path. We 419 need to provide mechanisms to patch a schedule (changing the channel 420 offset, allocating more timeslots, changing the path, etc.). 422 5.4. Soft transition after reconfiguration 424 Since RAW expects to support real-time flows, we have to support 425 soft-reconfiguration, where the novel resources are reserved before 426 the ancient ones are released. Some mechanisms have to be proposed 427 so that packets are forwarded through the novel track only when the 428 resources are ready to be used, while maintaining the global state 429 consistent (no packet reordering, duplication, etc.) 431 6. IANA Considerations 433 This document has no actionable requirements for IANA. This section 434 can be removed before the publication. 436 7. Security Considerations 438 This section will be expanded in future versions of the draft. 440 8. Acknowledgments 442 TBD 444 9. Informative References 446 [ipath] Gao, Y., Dong, W., Chen, C., Bu, J., Wu, W., and X. Liu, 447 "iPath: path inference in wireless sensor networks.", 448 2016, . 450 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 451 Requirement Levels", BCP 14, RFC 2119, 452 DOI 10.17487/RFC2119, March 1997, 453 . 455 [RFC6291] Andersson, L., van Helvoort, H., Bonica, R., Romascanu, 456 D., and S. Mansfield, "Guidelines for the Use of the "OAM" 457 Acronym in the IETF", BCP 161, RFC 6291, 458 DOI 10.17487/RFC6291, June 2011, 459 . 461 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 462 Weingarten, "An Overview of Operations, Administration, 463 and Maintenance (OAM) Tools", RFC 7276, 464 DOI 10.17487/RFC7276, June 2014, 465 . 467 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 468 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 469 May 2017, . 471 [RFC8655] Finn, N., Thubert, P., Varga, B., and J. Farkas, 472 "Deterministic Networking Architecture", RFC 8655, 473 DOI 10.17487/RFC8655, October 2019, 474 . 476 Authors' Addresses 478 Fabrice Theoleyre 479 CNRS 480 Building B 481 300 boulevard Sebastien Brant - CS 10413 482 Illkirch - Strasbourg 67400 483 FRANCE 485 Phone: +33 368 85 45 33 486 Email: theoleyre@unistra.fr 487 URI: http://www.theoleyre.eu 489 Georgios Z. Papadopoulos 490 IMT Atlantique 491 Office B00 - 102A 492 2 Rue de la Chataigneraie 493 Cesson-Sevigne - Rennes 35510 494 FRANCE 496 Phone: +33 299 12 70 04 497 Email: georgios.papadopoulos@imt-atlantique.fr 499 Grek Mirsky 500 ZTE Corp. 502 Email: gregimirsky@gmail.com