<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="no"?>
<?rfc subcompact="no"?>
<?rfc authorship="yes"?>
<?rfc tocappendix="yes"?>

<rfc category="std" ipr="trust200902" docName="draft-theoleyre-raw-oam-support-02">


<front> 

    <title abbrev="OAM features for RAW">Operations, Administration and Maintenance (OAM) features for RAW</title>
   
 
 
	<author initials="F." surname="Theoleyre" fullname="Fabrice Theoleyre">
    	<organization>CNRS</organization>
     	<address>
        	<postal>
            	<street>Building B</street>
             	<street>300 boulevard Sebastien Brant - CS 10413</street>
             	<city>Illkirch - Strasbourg</city>
             	<code>67400</code>
             	<country>FRANCE</country>
         	</postal>
         	<phone>+33 368 85 45 33</phone>
            <email>theoleyre@unistra.fr</email>
            <uri>http://www.theoleyre.eu</uri>
     	</address>
 	</author>

    <author initials="G.Z." surname="Papadopoulos" fullname="Georgios Z. Papadopoulos">
      <organization>IMT Atlantique</organization>
      <address>         
        <postal>            
           <street>Office B00 - 102A</street>
           <street>2 Rue de la Ch&acirc;taigneraie</street>
           <city>Cesson-S&eacute;vign&eacute; - Rennes</city>
            <code>35510</code>
            <country>FRANCE</country>
         </postal>
         <phone>+33 299 12 70 04</phone>
    <email>georgios.papadopoulos@imt-atlantique.fr</email>
      </address>
    </author>

    <author initials="G." surname="Mirsky" fullname="Grek Mirsky">
      <organization>ZTE Corp.</organization>
      <address>
          <email>gregimirsky@gmail.com</email>
      </address>
    </author>

   <date/>

   <workgroup>RAW</workgroup>

   <abstract>
   		<t>
            Some critical applications may use a wireless infrastructure.
              However, wireless networks exhibit a bandiwidth of several orders of magnitude lower than wired networks.
              Besides, wireless transmissions are lossy by nature; the probability that a packet cannot be decoded correctly  by the receiver may be quite high.
              In these conditions, guaranteeing the network infrastructure works properly is particularly challenging, since we need to address some issues specific to wireless networks.
              This document lists the requirements of  the Operation, Administration, and Maintenance (OAM) features recommended to construct a predictable communication
              infrastructure on top of a collection of wireless segments.
              This document describes the benefits, problems, and trade-offs for using OAM in wireless networks to achieve Service Level Objectives (SLO).

   		</t>
   </abstract>
   

</front>

<middle>

        

<!--section title="TEMPORARY EDITORIAL NOTES">

   <t>
		This document is an Internet Draft, so it is work-in-progress by nature.
   		It contains the following work-in-progress elements:
   </t>	
     		 
   <t>
   	<list style="symbols">
		<t>"TODO" statements are elements which have not yet been written by the authors 
		for some reason (lack of time, ongoing discussions with no clear consensus, etc).
		The statement does indicate that the text will be written at some time.</t>
   	</list>
   </t>
   		 
</section --> 



<section title="Introduction">
	<t>
        Reliable and Available Wireless (RAW) is an effort that extends DetNet
        to approach end-to-end deterministic performances over a network that
        includes scheduled wireless segments.
        In wired networks, many approaches to Quality of Service (QoS) tried
          to implement traffic differentiation so that routers handle differently each type of packets.
        However, this differentiated treatment was expensive for most applications.
    </t>
    
    <t>
        Deterministic Networking (DetNet) <xref target="RFC8655"/> has proposed to provide a bounded end-to-end latency
        on top of the network infrastructure, comprising both Layer 2 bridged and Layer 3 routed segments.
        Their work encompasses the data plane, OAM, time synchronization, management, control, and security aspects.
    </t>
    
    <t>
        However, wireless networks create specific challenges.
        First of all, radio bandwdidth is significantly lower than for wired networks.
        In these conditions, the volume of signaling messages has to be very limited.
        Even worse, wireless links are lossy: a layer 2 transmission may or may not be 
        decoded correctly by the receiver, depending on a large set of parameters.
        Thus, providing high reliability through only wireless segments only is particularly challenging.
    </t>
    
    <t>
        Last but not least, radio links present very unstable characteristics.
        If the wireless networks use an unlicensed band, packet losses are not anymore temporally
        and spatially independent.
        Typically, links may exhibit a very bursty characteristic, where several 
        consecutive packets may be dropped.
        Thus, providing availability and reliability on top of the wireless infrastructure 
        requires specific layer 3 mechanisms to counteract these bursty losses. 
    </t>
	   
    <t>
        Operations, Administration, and Maintenance (OAM) Tools are of primary importance
        for IP networks <xref target="RFC7276"/>.
        It defines a toolset for fault detection and isolation, and for performance measurement.
    </t>
       
    <t>
        The main purpose of this document is to detail the specific requirements of the OAM
        features recommended to construct a predictable communication infrastructure
        on top of a collection of wireless segments.
      	This document describes the benefits, problems, and trade-offs for using OAM in 
    	wireless networks to provide availability and predictability.
	</t>
    
   <t>
      	In this document, the term OAM will be used according to its definition specified 
      	in <xref target="RFC6291"/>.
      	We expect to implement an OAM framework in RAW networks to maintain a real-time 
      	view of the network infrastructure, and its ability to respect the Service Level
          Objectives (SLO), such as delay and reliability, assigned to each data flow.
   </t>
 
 
	<section title="Terminology">
    	<t>
        	<list style="symbols">
            	<t>
               		OAM entity: a data flow to be controlled;
            	</t>
                
                <!-- TODO-Greg-
                
                I think it is useful to differentiate between MEP and MIP. But we need to consider the origin of the term is Layer 2. For IETF we might choose/define somewhat different terms, e.g., Test Point for MEP and Monitor Point for MIP.
                -->
            
            	<t>
               		Maintenance End Point (MEP): OAM devices crossed when entering/exiting
                    the network. In RAW, it corresponds mostly to the source or destination
                    of a data flow. OAM message can be exchanges between two MEPs;
            	</t>
         
                <t>
                    Maintenance Intermediate end Point (MIP): OAM devices along the flow;
                    OAM messages can be exchanged between a MEP and a MIP;
                </t>

                <t>
            		Defect: a temporary change in the network (e.g. a radio link which is
                    broken due to a mobile obstacle);
         		</t>
         
         		<t>
            		Fault: a definite change which may affect the network performance,
            		e.g. a node runs out of energy.
            	</t>
         	</list>
    	</t>
	</section>
    
    
    <section anchor="acronyms-sec" title="Acronyms">
    <t>OAM                   Operations, Administration, and Maintanence</t>
    <t>DetNet               Deterministic Networking</t>
    <t>SLO                    Service Level Objective</t>
    <t>QoS                   Quality of Service</t>
    <t>SNMP               Simple Network Management Protocol</t>
    <t>SDN                  Software Defined Network</t>
    </section>

            <section title="Requirements Language">
             <t>
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
   NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED",
   "MAY", and "OPTIONAL" in this document are to be interpreted as
   described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/>
   when, and only when, they appear in all capitals, as shown here.
             </t>
             </section>
 

</section>





<!-- INTEREST OF OAM          -->
<section title="Role of OAM in RAW">
	<t>
      RAW networks expect to make the communications reliable and predictable on top of a
      wireless network infrastructure. 
      Most critical applications will define an SLO to be required for the data flows it
      generates.
      RAW considers network plane protocol elements such as OAM to improve the RAW operation 
      at the service and the forwarding sub-layers.
	</t>


	<t>
      To respect strict guarantees, RAW relies on an orchestrator able to monitor and
      maintain the network. Typically, a Software Defined Network (SDN) controller is in charge of scheduling the transmissions
      in the deployed network, based on the radio link characteristics, SLO of the flows, the
      number of packets to forward.
      Thus, resources have to be provisioned a priori to handle any defect. 
      OAM represents the core of the over provisioning process, and maintains the network 
      operational by updating the schedule dynamically.
	</t>

<!-- TODO-Greg-
Strictly speaking, PREOF also includes an order preservation. Do you see it as the required element in RAW?
-->

	<t>
      Fault-tolerance also assumes that multiple paths have to be provisioned
      so that an end-to-end circuit keeps on existing whatever the conditions.
      The replication/elimination processes (PREOF) on a node is typically controlled by the
      central controller/orchestrator.
      OAM is in charge of controlling that PREOF is working properly on a node and within the domain.
	</t>

<!-- TODO-Greg-
I think that out-of-band OAM is useful in case of passive OAM that collects counters, statistics from the nodes, e.g., SNMP or Netconf/Restconf using YANG-based data models. Such methods may be used as systems have control and management channels already.
-->
	<t>
      To be energy-efficient, reserving some dedicated out-of-band resources
      for OAM seems idealistic, and only in-band solutions are considered here.
	</t>


	<t>
       RAW supports both proactive and on-demand troubleshooting.
	</t>
</section>




<!--  OPERATION          -->
<section title="Operation">
	<t>
    	OAM features will enable RAW with robust operation both for forwarding and routing 
    	purposes.
	</t>
    
    <!-- TODO-Greg-
    We can clarify that this is a method of passive OAM (according to RFC 7799 classification of OAM performance measurement methods).
    -->

    <section title="Information Collection">
           <t>
               Several solutions (e.g., Simple Network Management Protocol (SNMP), YANG-based data models) are already in charge of collecting the statistics.
               That way, we can encapsulate these statistics in specific monitoring packets, to send them to the controller.
               </t>
           </section>

    
    
    <section title="Continuity Check">
		<t>
			We need to verify that two endpoints are connected.
            In other words, there exists "one" way to deliver the packets between
            two endpoints A and B.
 		</t>
        
        </section>

        
        <section title="Connectivity Verification">

		<t>
            Additionally, to the Continuity Check, we have to verify the connectivity.
            This verification considers additional constraints, i.e., the absence of
            misconnection.
        </t>
        <t>
            In particular, the resources have to be reserved by a given flow, and no
             packets from other flows steal the corresponding resources. Similarly, the
             destination does not receive packets from different flows through its interface.
            </t>
     
    
        <t>
			It is worth noting that the control and data packets may not follow the same
            path, and the connectivity verification has to be conducted in-band without
            impacting the data traffic.
            <!-- TODO-Greg-
            It is critical for both CC, CV and, in general, active OAM methods. I think it can be listed in the Requirements for OAM in RAW section.
            -->
            Test packets must share the fate with the monitored data traffic without introducing congestion in normal network conditions.
		</t>


	</section>
        
        
   
    <section title="Route Tracing">
		<t>
            <!-- TODO-Greg-
            Do we want to point to ICMP Ping specifically? Principles used in ICMP Ping/traceroute have been used in other networks, for example, MPLS LSP ping.
            -->
       		Ping and traceroute are two very common tools for diagnostic. 
       		<!-- GP: we may check with Pascal, if he agrees, we may introduce the concepts of Bitmap-->
       		They help to identify a subset of the list of routers in the route.
       		However, to be predictable, resources are reserved per flow in RAW. 
       		Thus, we need to define route tracing tools able to track the route for a 
       		specific flow.
		</t>


		<t>
            Wireless networks are meshed by nature: we have many redundant radio links.
            These meshed networks are both an asset and a drawback: while several paths exist
            between two endpoints, we should choose the most efficient one(s), concerning
            specifically the reliability, and the delay.
           </t>
        
        <t>
            Thus, multipath routing can be considered to make the network fault-tolerant.
            Even better, we can exploit the broadcast nature of wireless networks to exploit
            meshed multipath routing: we may have multiple Maintenance Intermediate Endpoints for each
            hop in the path.
            In that way, each Maintenance Intermediate Endpoint has several possible next hops
            in the forwarding plane.
			Thus, all the possible paths between two maintenance endpoints should be 
			retrieved.
        </t>
    </section>

   
	<section title="Fault Verification/detection">
		<t>
      		RAW expects to operate fault-tolerant networks. 
      		Thus, we need mechanisms able to detect faults, before they impact the network 
      		performance.
		</t>


		<t>
     		The network has to detect when a fault occurred, i.e., the network has deviated
             from its expected behavior.
     		While the network must report an alarm, the cause may not be identified 
     		precisely. 
     		For instance, the end-to-end reliability has decreased significantly, or a 
     		buffer overflow occurs.
		</t>

	</section>


	<section title="Fault Isolation/identification">
    	<t>
        	The network has isolated and identified the cause of the fault. 
        	For instance, the quality of a specific link has decreased, requiring more 
        	retransmissions, or the level of external interference has locally increased.
		</t>
	</section>
</section>



<!--    ADMINISTRATION          -->
<section title="Administration">
   
   <t>
      The network has to expose a collection of metrics to support an operator making proper decisions, including:
      
      <list style="symbols">
         <t>
            Packet losses: the time-window average and maximum values of the
            number of packet losses have to be measured. Many critical applications
            stop to work if a few consecutive packets are dropped;
            </t>
         
         <t>
            Received Signal Strength Indicator (RSSI) is a very common metric
            in wireless to denote the link quality. The radio chipset is in charge
            of translating a received signal strength into a normalized
            quality indicator;
            </t>
         
         <t>
            Delay: the time elapsed between a packet generation / enqueuing
            and its reception by the next hop;
            </t>
         
         <t>
            Buffer occupancy: the number of packets present in the buffer,
            for each of the existing flows.
            </t>
         
      </list>
      </t>
   <t>
      These metrics should be collected:
      <list style="symbols">
         <t>
            per virtual circuit to measure the end-to-end performance for a given flow.
            Each of the paths has to be isolated in multipath routing strategies;
            </t>
         
         <t>per radio channel to measure, e.g., the level of external interference,
            and to be able to apply counter-measures (e.g. blacklisting)
            </t>
         
         <t>
            per device to detect misbehaving node, when it relays the packets of
            several flows.
            </t>
         </list>
   </t>
   
   
    <section title="Collection of metrics">
   
   
         <t>
             We have to minimize the number of statistics / measurements to exchange:
             <list style="symbols">
             <t>
                 energy efficiency: low-power devices have to limit the volume of monitoring
                 information since every bit consumes energy.
             </t>
             <t>
                 bandwidth: wireless networks exhibit a bandwidth significantly lower than
                 wired, best-effort networks.
             </t>
             <t>
                 per-packet cost: it is often more expensive to send several packets instead
                 of combining them in a single link-layer frame.
             </t>
             </list>
       
             Thus, localized and centralized mechanisms have to be combined together, and
             additional control packets have to be triggered only after a fault detection.
         </t>
         
    <!-- TODO     <t>
             TODO:  be expanded to how active OAM and on-path OAM (for example iOAM, iPath, INT from P4) can be combined, complement each other.
             
             </t>
-->
       </section>
         
   <section title="Worst-case metrics">
      <t>
         RAW aims to enable real-time communications on top of a heterogeneous architecture.
         Since wireless networks are known to be lossy, RAW has to implement strategies to improve reliability on top of unreliable links.
         Hybrid Automatic Repeat reQuest (ARQ) has typically to enable retransmissions
         based on the end-to-end reliability and latency requirements.
      </t>
      
      <t>
         To make correct decisions, the controller needs to know the distribution
         of packet losses for each flow, and each hop of the paths.
         In other words, the average end-to-end statistics are not enough.
         They must allow the controller to predict the worst-case.
         </t>
      
   </section>
   
   <section title="Energy efficiency constraint">
      <t>
         RAW targets also low-power wireless networks, where energy represents a key
         constraint.
         Thus, we have to take care of power and bandwidth consumption.
         The following techniques aim to reduce the cost of such maintenance:
         <list>
            <t>
               piggybacking: some control information are inserted in the data packets
               if they do not fragment the packet (i.e., the MTU is not exceeded).
               Information Elements represent a standardized way to handle such information;
               </t>
            
            <t>
               flags/fields: we have to set-up flags in the packets to monitor to be able
               to monitor the forwarding process accurately.
               A sequence number field may help to detect packet losses.
               Similarly, path inference tools such as <xref target="ipath"/> insert
               additional information in the headers to identify the path followed by a
               packet a posteriori.
               </t>
            </list>
         
         </t>
      
      </section>
   
   </section>
   
   
   
   
   
   
   


<!--   MAINTENANCE          -->

<!-- speak about predictive maintenance? -->

   <section title="Maintenance">
      <t>
         RAW needs to implement a self-healing and self-optimization approach.
         The network must continuously retrieve the state of the network,
         to judge about the relevance of a reconfiguration, quantifying:
         <list>
            <t> the cost of the sub-optimality: resources may not be used
               optimally (e.g., a better path exists);
            </t>
            <t>
               the reconfiguration cost: the controller needs to trigger some
               reconfigurations.
               For this transient period, resources may be twice reserved, and control packets have to be transmitted.
            </t>
         </list>
         Thus, reconfiguration may only be triggered if the gain is significant.
         </t>
      
     
 <!-- TODO
     <t>
         TODO: OAM is the glue between the controller (e.g., SDN controller) and the wireless infrastructure.
         In other words, we need to maintain a very precise and configurable view of the network performances.
         
         </t>
      -->

      <section title="Multipath Routing">
           <t>
               To be fault-tolerant, several paths can be reserved between two maintenance
               endpoints. 
               They must be node-disjoint so that a path can be available at any time.
               </t>
           
           </section>
           
      <section title="Replication / Elimination">
          <t>
              When multiple paths are reserved between two maintenance endpoints,
              they may decide to replicate the packets to introduce redundancy, and
              thus to alleviate transmission errors and collisions. 
              For instance, in <xref target="fig_replication"/>, the source node S is 
              transmitting the packet to both parents, nodes A and B.              
              Each maintenance endpoint will decide to trigger the replication/elimination
              process when a set of metrics passes through a threshold value.
          </t>
        
          <figure anchor="fig_replication" align="center" title="Packet Replication: S transmits twice the same data packet, to its DP             (A) and to its AP (B).">
            <artwork align="center"><![CDATA[
 
               ===> (A) => (C) => (E) === 
             //        \\//   \\//       \\
   source (S)          //\\   //\\         (R) (root)
             \\       //  \\ //  \\      //
               ===> (B) => (D) => (F) ===

            ]]></artwork>
          </figure>
          
	  </section>
         
      <section title="Resource Reservation">
          <t>
            Because the QoS criteria associated with a path may degrade, the network has
            to provision additional resources along the path. We need to provide
            mechanisms to patch a schedule (changing the channel offset, allocating more
            timeslots, changing the path, etc.).
            </t>
          
        </section>
      
      <section title="Soft transition after reconfiguration">
          <t>
              Since RAW expects to support real-time flows, we have to support soft-reconfiguration,
              where the novel resources are reserved before the ancient ones are released.
              Some mechanisms have to be proposed so that packets are forwarded through the
              novel track only when the resources are ready to be used, while maintaining
              the global state consistent (no packet reordering, duplication, etc.)
              </t>
          
        
              
        </section>
      </section>

   

      
      
  <section anchor="iana-consider" title="IANA Considerations">
  <t>This document has no actionable requirements for IANA. This section can be removed before the publication.</t>
  </section>
   
     <section anchor="security" title="Security Considerations">
     <t>
This section will be expanded in future versions of the draft.
     </t>

     </section>
     
           <section title="Acknowledgments">
         <t>
        TBD
         </t>
      </section>
      
</middle>


<back>
   
   <!--
    <references title="Normative References">
    </references>
    -->
   
   <references title="Informative References">
		<?rfc include="reference.RFC.6291"?> <!-- OAM definition -->
        <?rfc include="reference.RFC.7276"?> <!-- An Overview of Operations, Administration, and Maintenance (OAM) Tools -->
        <?rfc include="reference.RFC.8174"?> <!-- Ambiguity of Uppercase vs Lowercase -->
        <?rfc include="reference.RFC.8655"?> <!-- Deterministic Networking Architecture -->
        <?rfc include="reference.RFC.2119"?> <!-- Key words for use in RFCs to Indicate Requirement Levels  -->

      
      <reference anchor="ipath" target="https://doi.org/10.1109/TNET.2014.2371459">
         <front>
            <title>iPath: path inference in wireless sensor networks.</title>
            <author initials="Y." surname="Gao" fullname="Y.Gao"></author>
            <author initials="W." surname="Dong" fullname="W. Dong"></author>
            <author initials="C." surname="Chen" fullname="C. Chen"></author>
            <author initials="J." surname="Bu" fullname="J. Bu"></author>
            <author initials="W." surname="Wu" fullname="W. Wu"></author>
            <author initials="X." surname="Liu" fullname="X. Liu"></author>
            <date year="2016"/>
            <!--
             IEEE/ACM Transactions on Networking (TON) archive
             Volume 24 Issue 1, February 2016
             Pages 517-528
             -->
         </front>
      </reference>
      
      
   </references>
   
   

 <!-- TODO
 
 
   <section title="TO BE ADDRESSED">
       <t>
           3 principles for high reliability:
           <list style="symbols">
           
               <t>
                   elimination of single points of failure
               </t>
               
               <t> reliable crossover: This is not treated explicitly but is tacit within 5.2. Internet
               Protocol executes the principle. The problem with this draft is it
               advocates adding state into the routing fabric without seeming to
               realize that that tends to defeat the reliable crossover principle. 5.3
               resource reservation clearly does this. Therefore, I recommend that the
               reliable crossover principle get explicit treatment at this stage so
               the tradeoffs can be properly weighed.
                 (We've gone through resource reservation protocols before - remember
               RSVP? - and they've not been very successful).

                   
                   </t>
           <t>detection of faults as they occur: 3.3 and 3.4 treat this, but, IMHO, the organization should again be
           cleaned up.  Back in the days when Simple Network Management Protocol
           was being worked up, a common acronym was FCAPS for [Fault Configuraion
           Accounting Performance Security]. Sounds to me like a pattern that this
           discussion could be mapped onto.
              In this case we have performance management solutions commingled
           with fault management ones. My experience is that QoS issues should
           always be the _last_ to be discussed, not the first.
              SNMP is nowhere mentioned in the draft. Sooner or later, you're
           going to need gets, sets and traps.  Is now the right time?

               
               </t>
           </list>
       </t>
       
       
       </section>
-->

</back>

</rfc>
