idnits 2.17.1 

draft-jiang-nmlrg-traffic-machine-learning-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 3, 2016) is 2884 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 2818
     (Obsoleted by RFC 9110)

  -- Obsolete informational reference (is this intentional?): RFC 5246
     (Obsoleted by RFC 8446)

  -- Obsolete informational reference (is this intentional?): RFC 7749
     (Obsoleted by RFC 7991)


     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Machine Learning Research Group                    S. Jiang, Ed.
3	Internet-Draft                                                    B. Liu
4	Intended status: Informational              Huawei Technologies Co., Ltd
5	Expires: December 5, 2016                                 P. Demestichas
6	                                                   University of Piraeus
7	                                                             J. Francois
8	                                                                   Inria
9	                                                             G. M. Moura
10	                                                               SIDN Labs
11	                                                               P. Barlet
12	                                                       Network Polygraph
13	                                                            June 3, 2016

15	 Use Cases of Applying Machine Learning Mechanism with Network Traffic
16	             draft-jiang-nmlrg-traffic-machine-learning-00

18	Abstract

20	   This document introduces a set of use cases in which machine learning
21	   technologies are applied to network traffic relevant activities,
22	   including machine learning based traffic classification, traffic
23	   management, etc.

25	Status of This Memo

27	   This Internet-Draft is submitted in full conformance with the
28	   provisions of BCP 78 and BCP 79.

30	   Internet-Drafts are working documents of the Internet Engineering
31	   Task Force (IETF).  Note that other groups may also distribute
32	   working documents as Internet-Drafts.  The list of current Internet-
33	   Drafts is at http://datatracker.ietf.org/drafts/current/.

35	   Internet-Drafts are draft documents valid for a maximum of six months
36	   and may be updated, replaced, or obsoleted by other documents at any
37	   time.  It is inappropriate to use Internet-Drafts as reference
38	   material or to cite them other than as "work in progress."

40	   This Internet-Draft will expire on December 5, 2016.

42	Copyright Notice

44	   Copyright (c) 2016 IETF Trust and the persons identified as the
45	   document authors.  All rights reserved.

47	   This document is subject to BCP 78 and the IETF Trust's Legal
48	   Provisions Relating to IETF Documents
49	   (http://trustee.ietf.org/license-info) in effect on the date of
50	   publication of this document.  Please review these documents
51	   carefully, as they describe your rights and restrictions with respect
52	   to this document.  Code Components extracted from this document must
53	   include Simplified BSD License text as described in Section 4.e of
54	   the Trust Legal Provisions and are provided without warranty as
55	   described in the Simplified BSD License.

57	Table of Contents

59	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
60	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
61	   3.  Methodology of Learning from Traffic  . . . . . . . . . . . .   4
62	     3.1.  Data of the Network Traffic . . . . . . . . . . . . . . .   4
63	     3.2.  Data Source and Storage . . . . . . . . . . . . . . . . .   5
64	     3.3.  Architecture Considerations . . . . . . . . . . . . . . .   5
65	     3.4.  Closed Control Loop . . . . . . . . . . . . . . . . . . .   6
66	   4.  Use Cases Study of Applying Machine Learning in Network . . .   6
67	     4.1.  HTTPS Traffic Classification  . . . . . . . . . . . . . .   6
68	     4.2.  Malicious Domains: Automatic Detection with DNS Traffic
69	           Analysis  . . . . . . . . . . . . . . . . . . . . . . . .   9
70	     4.3.  Machine-learning based Policy Derivation and Evaluation
71	           in Broadband Networks . . . . . . . . . . . . . . . . . .  10
72	     4.4.  Traffic Anomaly Detection in the Router . . . . . . . . .  11
73	     4.5.  Applications of Machine Learning to Flow Monitoring . . .  12
74	   5.  Security Considerations . . . . . . . . . . . . . . . . . . .  15
75	   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15
76	   7.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  15
77	   8.  Change log [RFC Editor: Please remove]  . . . . . . . . . . .  16
78	   9.  Informative References  . . . . . . . . . . . . . . . . . . .  16
79	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  17

81	1.  Introduction

83	   Machine learning technology has been successful in solving
84	   complicated issues.  It helps to make predictions or decisions based
85	   on large datasets.  It could also dynamically adapt to varying
86	   situations and response to real-time issues.  Therefore, more and
87	   more research starts on applying machine learning in the network
88	   area.

90	   Among many aspects of networks, the network traffic is one of the
91	   most complicated managed objectives.  Its volume is rapidly growing
92	   along with the Internet explosion.  It is always dynamically
93	   changing.  Most network traffic flows only last a few minutes, or
94	   even shorter.  And the user contents within traffic is becoming more
95	   diverse due to the development of various network services, and
96	   increasing use of encryption.  Consequently, it is more and more
97	   challenging for administrators to get aware of the network's running
98	   status and efficiently manage the network traffic flows.  Although
99	   more and more data regarding network traffics are generated,
100	   traditional mechanisms based on pre-designed network traffic patterns
101	   become less and less efficient.

103	   It is natural to utilize powerful machine learning technology to
104	   analyze the large mount of data regarding network traffic, to
105	   understand the network's status, such as performance, failures,
106	   security, etc.  It is a big advantage that machines can measure and
107	   analyse the network traffic, then report the results and predictions
108	   to humans for further decision.  The machines could handle vast
109	   amounts of data which is almost impossible for humans to deal with,
110	   in close to real time.  Even more, if the speed and accuracy of the
111	   prediction is high enough, it is possible that the subsequent action
112	   based on the prediction result could form a closed control loop to
113	   achieve autonomic management.  However, the maturity of latter might
114	   be far in the future.  Today, the traditional control programs still
115	   look more reliable than machine learning based control mechanisms.

117	   This document firstly analyzes the data of the network traffic from
118	   various perspectives; and also discusses several important practical
119	   considerations, including the training data source, data storage and
120	   the learning system architecture.  It then introduce a set of use
121	   cases, which have been shown to work well although there is large
122	   scope for improvements, including ML-based traffic classification,
123	   traffic management, interface failure prediction, etc.

125	   Editor notice: this document is in the primary stage.  It collects
126	   the use cases presented in the proposed Network Machine Learning
127	   Research Group (NMLRG) session in IETF95 meeting.

129	2.  Terminology

131	   The terminology defined in this document.

133	   Machine Learning  A computational mechanism that analyzes and learns
134	      from data input, either historic data or real-time feedback data,
135	      following a set of designed features and algorithms.  It can be
136	      used to make analysis, predictions or decisions, rather than
137	      following strictly static program instructions.

139	   Network Traffic  The amount of data moving across a network at a
140	      given point of time.  They are mostly encapsulated in network
141	      packets.

143	   Traffic Flow  A sequence of packets from a source computer to a
144	      destination [RFC6437].  It is the unit of network traffic.

146	   Feature (machine learning)  In machine learning and pattern
147	      recognition, a feature is an individual measurable property of a
148	      phenomenon being observed.  Choosing informative, discriminating
149	      and independent features is a crucial step for effective
150	      algorithms in pattern recognition, classification and regression.

152	   Algorithm (machine learning)  Machine learning algorithms operate by
153	      building a model from example inputs in order to make data-driven
154	      predictions or decisions expressed as outputs, rather than
155	      following strictly static program instructions.  A incomplete list
156	      of machine learning algorithms includes supervised learning,
157	      unsupervised learning, semi-supervised learning, reinforcement
158	      learning, deep learning, etc.

160	3.  Methodology of Learning from Traffic

162	3.1.  Data of the Network Traffic

164	   There is plenty of valuable data related to the network traffic.
165	   These data are raw features in learning process.  Following is a
166	   simple classification of network traffic data.

168	   Measurable properties  There are many measurable properties of
169	      network traffic, such as latency, number of packets, duration,
170	      etc.  These properties are also very essential features,
171	      especially for use cases relevant to performance, QoS (Quality of
172	      Service), etc.

174	   Data within communication protocols  The user contents are
175	      encapsulated in layered communication protocols.  Many information
176	      are contained within the protocol headers, for example the source
177	      and destination IP addresses in the IP header, the port numbers in
178	      the TCP/UDP header, etc.  Transport layer protocols are often
179	      related to the type of applications, such as FTP (File Transfer
180	      Protocol) for file transfer, HTTP (Hyper Text Transfer Protocol)
181	      for web, etc; and many application-relevant data are embedded
182	      within these protocols.  These could also be essential data for
183	      classification or application-oriented analysis.  However, some
184	      traffic will not provide transport or application information, due
185	      to unknown protocols or encryption.

187	   User content  User contents are the payload of packets, which might
188	      be obtained by DPI (Deep Packet Inspection) within the transit
189	      network if the packets are unencrypted, or they could be analyzed
190	      by the source or destination nodes.

192	   Data in network signaling protocols  Traffic flows are managed or
193	      indirectly influenced by various network signaling protocols.  For
194	      example, the routing protocols determine the next hop of a
195	      specific network traffic flow, or even the traffic path (by some
196	      sophisticated routing protocol such as MPLS-TE (Multi-Protocol
197	      Label Switching - Traffic Engineering), segment routing, etc.);
198	      the P2P (Peer to Peer) protocol can even decide the destination of
199	      a specific content traffic.  They are relevant and are potential
200	      features for traffic analysis.  Furthermore, the traffic of these
201	      signaling protocols themselves may also be learning objectives.

203	3.2.  Data Source and Storage

205	   Within networks, forwarding devices such as routers, switches,
206	   firewalls, etc., are the entities that directly handle the network
207	   traffic.  Thus, they could collect network traffic data, such as
208	   measurable properties, protocol information, etc.  Source nodes or
209	   destination nodes, particularly servers, could also be the source of
210	   network traffic data.  They could either report the collected data to
211	   a central repository for storage and learning, or collect and store
212	   the data by themselves for local learning.  This depends on the
213	   learning architecture, which is discussed in the following section.

215	3.3.  Architecture Considerations

217	   Global learning vs. local learning

219	      *  Global learning refers to the tasks that are mostly network-
220	         level, so that they need to be done in a global viewpoint.  In
221	         this case, the learning entity is normally centralized and is
222	         different from the data source entities.

224	      *  Local learning is more applicable to the tasks that are only
225	         relevant to one or a limited group of devices, and they could
226	         be done directly within that one node or that limited group of
227	         nodes.  In this case of grouped nodes, the data may also need
228	         to be transited from the data source entity to learning entity.

230	   Offline & online learning

232	      *  Co-located mode: training (offline, based on historic data) and
233	         prediction (online, based on real-time data) are both done
234	         within the same entity.  The entity could be a central
235	         repository or a specific node.

237	      *  De-coupled mode: training is done in the central repository,
238	         and prediction is made by the routers/switches/firewalls or
239	         other devices that directly process the network traffic.

241	   Central learning & distributed learning  Central learning means the
242	      learning process is done at a single entity, which is either a
243	      central repository or a node.  Distributed learning refer to
244	      ensemble learning that multiple entities do the learning
245	      simultaneously and ensemble the results together to sort out a
246	      final results.  Since network devices are naturally distributed,
247	      it could be foreseen that ensemble learning is a good approach for
248	      a certain of use cases.

250	3.4.  Closed Control Loop

252	   The prediction made by machine learning mechanism could be directly
253	   used on manipulating the network traffic, or other relevant actions,
254	   such as changing the device configuration, etc.

256	   However, as the introduction section said, this kind of utilization
257	   might be suitable only for a small set of the use cases, due to the
258	   limited accuracy of machine learning technologies.  Besides, some
259	   critical usages simply cannot tolerate any false decision.

261	4.  Use Cases Study of Applying Machine Learning in Network

263	   Editor notes: This section is a collection of the work presented in
264	   the proposed NMLRG session in IETF95 meeting.  More contributions on
265	   use cases are welcome.

267	4.1.  HTTPS Traffic Classification

269	   Managing network traffic requires a good understanding of the content
270	   of traffic flows for various purposes.  Indeed, enhancing the QoS by
271	   prioritizing or scheduling the flows or enforcing security policies
272	   by filtering some of them cannot solely on rely protocol headers like
273	   IP, TCP or UDP headers.  Analyzing the user content with DPI is so
274	   necessary.  However, this poses serious concerns regarding the user
275	   privacy.  In addition, OTT (Over-the-Top) actors would prefer to
276	   fully control their network traffic rather than being subject to any
277	   intermediaries policies.  As a result, encrypting the traffic has
278	   been widely adopted in last years.

280	   In that context, traffic management is facing to severe difficulties
281	   since DPI is not efficient anymore.  Using an intermediary service or
282	   proxy are the only ways to analyze the content of encrypted traffic
283	   but it requires a high trustfulness in the intermediaries and so not
284	   always guaranteed, for example with end-users of an operator
285	   networks.

287	   Therefore, new techniques wit the ability to extract knowledge and
288	   insight from encrypted flows is necessary.  Especially HTTPS

290	   [RFC2818] is now a major protocol use over Internet because it
291	   provides secure Web communication while Web is now embracing various
292	   services which have been provided apart in the past: email, video
293	   streaming, chat, VoIP, file sharing, etc.  It relies on TLS
294	   (Transport Layer Security) [RFC5246], [RFC6066] to encapsulate HTTP
295	   requests.

297	   Being able to identify the service and the providers of an HTTPS
298	   connection would help in applying different strategies for managing
299	   the corresponding flow.  For instance, VoIP (Voice over IP) and email
300	   do not require the same QoS or some service use might be prohibited
301	   like file sharing to avoid data leakage in a company.

303	   As a concrete example, Google, Facebook or Amazon are service
304	   providers while maps, drive, gmail are services of Google.  To
305	   identify them when they are accessed by a user, IP addresses and DNS
306	   (Domain Name System) names based identification is not reliable as
307	   the users can relies on intermediates to respectively serve as proxy
308	   or resolve DNS requests.  The SNI (Server Name Indication) [RFC5246]
309	   is an extension of HTTPS which is indicated by the user when
310	   initiating the TLS handshake (Client Hello).  SNI actually contains
311	   the hostname to which the request is addressed.  Such an hostname is
312	   significative of the service and service provider name.  However, SNI
313	   is an optional field and can be easily forged to circumvent HTTPS
314	   filtering without impacting service use [bypasssni].  More advanced
315	   mechanisms are hence necessary to improve the robustness of
316	   identification even in the case of non collaborative users.

318	   Because the objective is to automatically label an HTTPS connection
319	   by the service and service provider associated with.  The TLS
320	   handshake is not encrypted but data exchanged during this phase
321	   (random number, selected ciphers,...) is not distinctive of the
322	   accessed service.  However, the nature of accessed service directly
323	   impacts on user content transmitted through the secure channel
324	   especially on the type, size and way to transmit those data.  Such
325	   metadata are still measurable properties.

327	     HTTPS Connection
328	           +
329	           |(1)
330	   +-------v------+
331	   |TLS Connection|
332	   |Reconstruction|
333	   +-------+------+
334	           |(2)
335	   +-------v------+    (3')                    (4')
336	   |  Features    +-------------+----------------------------+
337	   |  Extraction  |             |                            |
338	   +-------+------+     +-------v---------+             +----v----+
339	           |            |Service Provider +------------->Services |
340	           |(3)         |L1 model         |   Load      |L2 model |
341	           |            +-------^---------+   services  +----^----+
342	   +-------v------+             |             model X        |
343	   |SNI Labelling |             +----------------------------+
344	   +-------+------+                         |(5)
345	           |            +-----------------------------------------+
346	           +------------>              Training and               |
347	                   (4)  |              Models building            |
348	                        +-----------------------------------------+

350	   Two-levels HTTPS traffic classification

352	   In figure above, step(1) consists in reconstructing the HTTPS
353	   connection and retrieving packets on top of which the following
354	   metrics are observed (2):

356	   o  Inter Arrival Time

358	   o  Packet size

360	   o  Encrypted data size: this feature has the advantage to be strongly
361	      related to the service accessed instead of the packet size which
362	      is biased by other lower layer headers

364	   Based on these values, aggregated features are computed: average,
365	   minimum, maximum, 25th percentile, median, 75th percentile.

367	   Because different providers may offer a similar service, a single
368	   classifier could fail to to distinguish them.  A multi-level machine
369	   learning approach has been proposed.  For learning, a dataset without
370	   forged SNI is used (3) to build the classifiers (4).  The result is
371	   (5):

373	   o  a first level model (L1 model) whose the goal is to identify the
374	      service provider,

376	   o  a set of second level models (L2 models), one for each service
377	      provider to identify specific service of a service provider

379	   Once all classifiers are trained, a new unknown HTTPS connection is
380	   first matched against the LV1 model (3').  The output is the
381	   predicted service provider but also leads to load the corresponding
382	   LV2 model (4') to determine the specific service of this service
383	   provider.

385	   This framework is independent of the ML technique. being used.  Each
386	   model could be also built with a different technique but our study
387	   have shown that best results are obtained with Random Forest.

389	   The HTTPS classification framework has been tested over 288,901
390	   connections from lab users.  Standard evaluation procedure have been
391	   applied.  Less representative features have been automatically
392	   discarded.  Using a ten-fold cross-validation, each tested connection
393	   has been marked as perfect identification (both the service provider
394	   and the service name are rightly identified), partial identification
395	   (only the service provider is identified) or invalid (none of them).
396	   93.1% falls in the first category, 2.9% in the second and the rest in
397	   the third.  Full results are available in [httpsframework].

399	   Although results are promising, the current method can only be
400	   applied at the end once the HTTPS connection, i.e. after being
401	   reconstructed.  This avoids to apply any kind of policies to the
402	   corresponding traffic flow.  Future challenge is thus to classify the
403	   connection before it ends in order to apply.

405	4.2.  Malicious Domains: Automatic Detection with DNS Traffic Analysis

407	   Since their inception, domain names have been used to provide a
408	   simple identification label for hosts, services, applications, and
409	   networks on the Internet [RFC1034].  In the same way, domains and the
410	   DNS infrastructure have also been misused in various types of abuses,
411	   such as phishing, spam, malware distribution, among others.

413	   Newly registered malicious domain names are well-know to a very
414	   distinct initial DNS lookup pattern than legitimates ones: typically,
415	   they exhibit an abnormally higher number of lookups [Hao2011].  One
416	   of the reasons is that malicious domains tend to rely upon spam
417	   campaigns within the first ours after the registration of these
418	   domains in order to maximize the number of victims before the domain
419	   is detected and taken down.

421	   In order to protect users from such domains, nDEWS (New Domains Early
422	   Warning System) [Moura2016], a tool that classifies the newly
423	   registered domains based on their initial lookup pattern, has been
424	   proposed.  To perform that, it is required to have access to (i) a
425	   domains registration database and (ii) authoritative DNS server
426	   traffic data, which is typically the case for Top-Level Domains (TLD)
427	   registries.  These domains are classified using k-means as a
428	   clustering method into two clusters using four features extracted
429	   from the analyzed DNS traffic: # DNS queries, # IP addresses, #
430	   Autonomous Systems (ASes), and # Countries, which were chosen
431	   empirically.

433	   As a result, in an automated fashion, a large variety of suspicious
434	   domains can be detected, including phishing, malware, but also other
435	   types, such as fake pharmaceutical shops as well as counterfeit
436	   sneakers.  In this particular case, the responsible registrars are
437	   notified in this pilot study about these websites.  Ultimately, it
438	   allows these websites to be taken down, minimizing the potential
439	   number of victims.

441	4.3.  Machine-learning based Policy Derivation and Evaluation in
442	      Broadband Networks

444	   Service provisioning is becoming more complex.  For instance, there
445	   are services having diverse quality requirements, there is variance
446	   of the requirements in time and space, and there is the need for
447	   utmost resource efficiency.  Moreover, full agility in time and space
448	   (in order to accomplish resource efficient service provisioning)
449	   requires the solution of computationally intensive tasks.  In this
450	   respect, policies can play a role: specify the network behaviour in
451	   time periods and service area regions.

453	   In this direction, machine learning can have a fundamental role,
454	   e.g., for learning situations encountered and "good" ways (policies)
455	   for handling them.  The contribution addresses the role that machine
456	   learning can play for policy derivation and evaluation.  In more
457	   detail it addresses the requirements on the role of machine learning,
458	   including potential inputs and outputs.

460	   Knowledge and machine learning can be an important aspect of wireless
461	   networks.  Knowledge is created both regarding the contexts and their
462	   occurrence, as well as on the association of the context with
463	   specific actions and its scoring.  The latter encompasses development
464	   of knowledge on how to handle acquired contexts; this knowledge will
465	   include the contexts encountered, the corresponding handlings done
466	   (decisions applied), the potential alternative handlings, and the
467	   respective efficiency of each handling (actually applied or
468	   alternate).

470	   Reinforcing "good" solutions per each encountered context (e.g.
471	   reinforcement learning) can be a vital and unique element of a
472	   knowledge-based management system.  Machine learning can be realized
473	   through clustering to discover underlying structures in data,
474	   regression to identify patterns and predict values in cell and
475	   network usage, classification to classify first-seen unknown users,
476	   and density estimation to model complex user behavior and network
477	   usage.  Several deep architectures and techniques (such as pre-
478	   training) can be utilized, in order to generalize better on complex
479	   data with underlying information and be able to make accurate
480	   predictions, even on unseen data.

482	   As a result, depending on what we want to achieve, the proper machine
483	   learning approach can be used.

485	   Through machine learning it will be possible to provide faster and
486	   targeted solutions to specific network problems.  Moreover, it is
487	   possible cluster various usage profiles and prioritize the traffic
488	   according to the criticality level.  For instance, mission critical
489	   services need special attention with respect to latency and
490	   prioritization, compared to plain services which may tolerate a bit
491	   of delay without jeopardizing the overall quality.  In addition,
492	   machine learning can lead to improved results in KPIs (Key
493	   Performance Indicator) such as end-user throughput, latency, energy
494	   consumption and overall cost effectiveness.  Moreover, reliability
495	   can be increased since certain problematic situations may be
496	   predicted before happening, hence it will be possible to act pro-
497	   actively and alleviate the negative impact of a problem in the
498	   network.

500	   It is evident that machine learning can have significant importance
501	   in policy derivation and evaluation in broadband networks, especially
502	   towards in 5G infrastructures which will be complex, heterogeneous
503	   and need to accommodate multi-services ranging from mobile broadband
504	   to massive machine type, mission critical and vehicular
505	   communications.

507	4.4.  Traffic Anomaly Detection in the Router

509	   Modern routers usually have the capability that makes alarms of high
510	   bandwidth usage rate of a specific interface.  When network traffic
511	   exceeds a certain threshold, the router will consider it as an
512	   anomaly event and report it to the NMS (Network Management System).
513	   For instance, in some routers/switches, there exists configuration
514	   such as "trap-threshold { input-rate | output-rate }" to trigger
515	   traffic alarms, which is statically configured by experienced
516	   administrators.  However, network traffic is usually not static and
517	   even changes significantly due to the changes of carried services,
518	   residential situation, and etc.  Thus, static configuration could not
519	   effectively identify the traffic anomaly events.

521	   To address above issue, machine learning technologies are applied for
522	   routers/switches to learn local traffic pattern and detect the
523	   traffic anomaly events based on the learning results.

525	   Wavelets are employed to analyze time-series network traffic for
526	   anomaly detection.  In some certain interval, the routers measure,
527	   record, and analyze the input and output traffic rates respectively,
528	   or in the form of rate sums.  (The former is recommended for a finer
529	   granularity analysis.)

531	   Running for some time, the router would get a set of "time-rate"
532	   data, collected as time-series waves for further wavelet analysis.
533	   Besides wavelets, this use case proposes other machine learning
534	   techniques such as outlier detection.  For this way, features are to
535	   be extracted from wavelets for supervised or unsupervised learning.

537	   After data collection, the router would sort up the data and figure
538	   out the alarm threshold statistically based on data distribution, to
539	   discriminate the normal and outlier traffic rates.  When interface
540	   traffic exceeds the threshold, the router would make alarms to the
541	   NMS.  The router could dynamically adjust the alarm threshold with
542	   new coming data, by periodical anomaly analysis.  This approach helps
543	   devices detect traffic anomaly more efficiently and effectively,
544	   compared to traditional way of learning at the central repository
545	   that collects traffic information from various devices.

547	   This use case could be extended from single interface to multiple
548	   ones, that is, device scope of multiple traffic waves, and even wider
549	   scope of multiple devices in a certain domain.  Thus would make the
550	   analysis more comprehensive.

552	   Besides wavelet analysis, there might be more techniques to explore,
553	   such as correlation analysis of traffic anomaly events among multiple
554	   devices.

556	4.5.  Applications of Machine Learning to Flow Monitoring

558	   A commercial cloud-based flow monitoring service from Network
559	   Polygraph [polygraph] has used Machine Learning analysis as a cost-
560	   effective alternative to DPI for traffic classification, which
561	   identifies the application responsible for each network traffic flow.

563	   Nowadays, DPI is considered as the standard technology for traffic
564	   classification.  However, DPI is generally expensive as it requires
565	   the analysis of the payload of every single packet.  This usually
566	   involves the use of powerful, specialized hardware appliances, which
567	   need to be deployed in every link to obtain full coverage of the
568	   network.  In the case of Network Polygraph, the use of DPI is
569	   impractical, because the volume of data to be exported to the cloud
570	   would be overwhelming (i.e., all traffic should be replicated).  A
571	   more viable alternative is the use flow-based monitoring
572	   technologies, such as NetFlow [RFC3954] or IPFIX [RFC7011], where the
573	   volume of exported data is significantly lower.  Flow-based
574	   monitoring technologies provide summarized information (e.g.,
575	   duration, traffic volume) for every connection (or "traffic flow")
576	   handled by a router.  The information available in flow records is
577	   more limited compared to DPI (e.g., packet payloads are not
578	   available).  As a result, most flow-based monitoring tools base their
579	   classification on the port numbers or simple heuristics, which are
580	   known to be highly unreliable.

582	   To address this problem, Network Polygraph uses a traffic
583	   classification approach based on ML.  Several studies showed that
584	   supervised learning can achieve similar classification accuracy to
585	   DPI at a fraction of its cost.  However, supervised methods suffer
586	   from some practical limitations that make them very difficult to
587	   deploy and maintain in production environments.  For example, they
588	   require a costly training phase prior to its deployment and need to
589	   be frequently retrained, every time there is a change in the network
590	   or in the network applications.

592	   This section describes the ML approach used by Network Polygraph for
593	   online classification of NetFlow/IPFIX traffic.  To solve the
594	   practical limitations of supervised learning, Network Polygraph
595	   incorporates an automatic retraining system.  Figure 1 shows the
596	   components and data flow of the classification engine, which is
597	   divided in two parts:

599	   o  The classification path (Figure 1, top) is in charge of the
600	      classification of the traffic online using ML.  The input of the
601	      classification path are the NetFlow/IPFIX flows exported by the
602	      routers, while the output are the classified flows.  Several
603	      traffic features are extracted from each flow, including the
604	      information directly available in the flow records (e.g.,
605	      addresses, ports, packet and byte counts) together with some
606	      features we construct (e.g., average packet size, rate and
607	      interarrival time).  The traffic features are the input of the
608	      traffic classification algorithm, whose function is to identify
609	      the application that generated the flow.  Among the different
610	      supervised algorithms, a C5.0 decision tree was selected, because
611	      it has been shown to present the best accuracy/cost ratio for
612	      traffic classification.  Other supervised methods, e.g., Support
613	      Vector Machine (SVM) and Artificial Neural Network (ANN), obtain
614	      similar accuracy, but classification and training times are faster
615	      with decision trees.  In Network Polygraph, training times are
616	      critical as the training path is continuously updating the
617	      classification model in the background.

619	   o  The training path (Figure 1, bottom) implements the automatic
620	      retraining system, which is responsible of automatically updating
621	      the classification model when it becomes obsolete.  To that end, a
622	      random packet-level sample of the network traffic is continuously
623	      collected using flow-based sampling.  Sampled flows are then
624	      labeled using DPI.  It is possible to use DPI in the training path
625	      because training can be performed only with a small data sample
626	      (e.g., 1/1000 flows).  This significantly reduces the
627	      computational overhead and volume of data to be exported.  The
628	      labeled sample is used to verify the accuracy of the
629	      classification model.  The system accuracy is estimated by
630	      comparing the output of DPI (training path) and C5.0
631	      (classification path) for those flows sampled in the training
632	      path.  If the estimated accuracy falls below a configurable
633	      threshold, the labeled sample is used to generate an updated model
634	      using only those features available in NetFlow/IPFIX (IP Flow
635	      Information Export) records.  This training process can also be
636	      performed in few vantage points, and use it for other networks
637	      where only NetFlow/IPFIX monitoring data is available.

639	CLASSIFICATION PATH

641	  NetFlow/ +----------+                         +----------+ Classified
642	   IPFIX   | Feature  |                         |   C5.0   |   flows
643	 +-------->|Extraction+------------------------>|Classifier+----------->
644	           |          |                         |          |
645	           +----------+                         +----------+
646	                                                      ^
647	                                                      |
648	TRAINING PATH         +----------+   +----------+     |
649	                      | NetFlow/ |   | Feature  |     | Retraining
650	                  +-->|  IPFIX   +-->|Extraction+--+  |
651	   Packet stream  |   |Generation|   |          |  |  |
652	  (flow sampling) |   +----------+   +----------+  |  |
653	 +--------------->|                                +--+ DPI-labeled
654	                  |           +----------+         |      NetFlow/
655	                  |           |   DPI    |         |       IPFIX
656	                  +---------->|   App.   +---------+
657	                              | Labeling |
658	                              +----------+

660	             Network Polygraph classification engine data flow

662	                                 Figure 1

664	   In order to validate the performance of the described ML approach,
665	   the accuracy of Network Polygraph was measured using a complete
666	   14-day trace from the 10-Gigabit link that connects the Catalan
667	   Research and Education Network (Anella Cientifica) to its Spanish
668	   counterpart (RedIRIS).  The trace contained about 70 million flows
669	   with a flow sampling rate of 1/400.  The experimental results showed
670	   that, with a 96% retraining threshold, the system sustained an
671	   average classification accuracy of 97.5%, needing only 15 retrainings
672	   during the 14 days, which were performed automatically without
673	   requiring any human intervention.  When the retraining threshold was
674	   decreased to 94%, the accuracy was slightly reduced to 96.76% with
675	   only 5 retrainings.

677	   The target objective is to progressively reduce the dependence on DPI
678	   technologies, which are expensive, difficult to deploy, not scalable,
679	   and not robust against encryption, in favor of flow-based machine
680	   learning approaches that are more cost-effective and can be easily
681	   offered as a cloud service.  In this direction, some research
682	   challenges include the classification of web services and CDN traffic
683	   from flow-based measurements, and the combination of multiple ground
684	   truths obtained from vantage points in different networks.

686	5.  Security Considerations

688	   This document is focused on applying machine learning in network,
689	   including of course applying machine learning in network security, on
690	   higher-layer concepts.  Therefore, it does not itself create any new
691	   security issues.

693	6.  IANA Considerations

695	   This memo includes no request to IANA.

697	7.  Acknowledgements

699	   The authors would like to acknowledge Josep Sanjuas, Andreas
700	   Georgakopoulos, Kostas Tsagkaris, Valentin Carela, Wazen M.  Shbair,
701	   Thibault Cholez, and Isabelle Chrisment for their contributions.

703	   The author would like to acknowledge the valuable comments made by
704	   participants in the IRTF Network Machine Learning Research Group,
705	   particular thanks to Lars Eggert, Brian Carpenter, Albert Cabellos,
706	   Shufan Ji, Susan Hares, Rudra Saha, and Dacheng Zhang.

708	   Jerome Francois was partly funded by Flamingo, a Network of
709	   Excellence project (ICT-318488) supported by the European Commission
710	   under its 7th Framework Programme.

712	   This document was produced using the xml2rfc tool [RFC7749].

714	8.  Change log [RFC Editor: Please remove]

716	   draft-jiang-nmlrg-traffic-machine-learning-00: original version,
717	   2016-06-03.

719	9.  Informative References

721	   [bypasssni]
722	              Shbair, W., Cholez, T., Goichot, A., and I. Chrisment,
723	              "Efficiently Bypassing SNI-based HTTPS Filtering", IFIP/
724	              IEEE International Symposium on Integrated Network
725	              Management (IM2015) , 2015.

727	   [Hao2011]  Hao, S., Feamster, N., and R. Pandrangi, "Monitoring the
728	              Initial DNS Behavior of Malicious Domains", Proceedings of
729	              the 2011 ACM SIGCOMM Conference on Internet Measurement
730	              Conference (IMC 2011) , Nov 2011.

732	   [httpsframework]
733	              Shbair, W., Cholez, T., Francois, J., and I. Chrisment, "A
734	              Multi-Level Framework to Identify HTTPS Services", IEEE/
735	              IFIP Network Operations and Management Symposium , 2016.

737	   [Moura2016]
738	              M. Moura, G., Mueller, M., Wullink, M., and C. Hesselman,
739	              "nDEWS: a New Domains Early Warning System for TLDs",
740	              IEEE/IFIP International Workshop on Analytics for Network
741	              and Service Management (AnNet 2016), co-located with IEEE/
742	              IFIP Network Operations and Management Symposium (NOMS
743	              2016) , 04 2016.

745	   [polygraph]
746	              "Network Polygraph", <https://polygraph.io>.

748	   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
749	              STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987,
750	              <http://www.rfc-editor.org/info/rfc1034>.

752	   [RFC2818]  Rescorla, E., "HTTP Over TLS", RFC 2818,
753	              DOI 10.17487/RFC2818, May 2000,
754	              <http://www.rfc-editor.org/info/rfc2818>.

756	   [RFC3954]  Claise, B., Ed., "Cisco Systems NetFlow Services Export
757	              Version 9", RFC 3954, DOI 10.17487/RFC3954, October 2004,
758	              <http://www.rfc-editor.org/info/rfc3954>.

760	   [RFC5246]  Dierks, T. and E. Rescorla, "The Transport Layer Security
761	              (TLS) Protocol Version 1.2", RFC 5246,
762	              DOI 10.17487/RFC5246, August 2008,
763	              <http://www.rfc-editor.org/info/rfc5246>.

765	   [RFC6066]  Eastlake 3rd, D., "Transport Layer Security (TLS)
766	              Extensions: Extension Definitions", RFC 6066,
767	              DOI 10.17487/RFC6066, January 2011,
768	              <http://www.rfc-editor.org/info/rfc6066>.

770	   [RFC6437]  Amante, S., Carpenter, B., Jiang, S., and J. Rajahalme,
771	              "IPv6 Flow Label Specification", RFC 6437,
772	              DOI 10.17487/RFC6437, November 2011,
773	              <http://www.rfc-editor.org/info/rfc6437>.

775	   [RFC7011]  Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
776	              "Specification of the IP Flow Information Export (IPFIX)
777	              Protocol for the Exchange of Flow Information", STD 77,
778	              RFC 7011, DOI 10.17487/RFC7011, September 2013,
779	              <http://www.rfc-editor.org/info/rfc7011>.

781	   [RFC7749]  Reschke, J., "The "xml2rfc" Version 2 Vocabulary",
782	              RFC 7749, DOI 10.17487/RFC7749, February 2016,
783	              <http://www.rfc-editor.org/info/rfc7749>.

785	Authors' Addresses

787	   Sheng Jiang (editor)
788	   Huawei Technologies Co., Ltd
789	   Q 22, Huawei Campus, No.156 Beiqing Road
790	   Hai-Dian District, Beijing, 100095
791	   P.R. China

793	   Email: jiangsheng@huawei.com

795	   Bing Liu
796	   Huawei Technologies Co., Ltd
797	   Q 22, Huawei Campus, No.156 Beiqing Road
798	   Hai-Dian District, Beijing, 100095
799	   P.R. China

801	   Email: leo.liubing@huawei.com
802	   Panagiotis Demestichas
803	   University of Piraeus
804	   Piraeus
805	   Greece

807	   Email: pdemestichas@gmail.com

809	   Jerome Francois
810	   Inria
811	   615 rue du jardin botanique
812	   54600 Villers-les-Nancy
813	   France

815	   Email: jerome.francois@inria.fr

817	   Giovane C. M. Moura
818	   SIDN Labs
819	   Meander 501
820	   Arnhem, 6825 MD
821	   The Netherlands

823	   Email: giovane.moura@sidn.nl

825	   Pere Barlet
826	   Network Polygraph
827	   Edifici K2M - Parc UPC
828	   Jordi Girona, 1-3, Barcelona  08034
829	   Spain

831	   Email: pbarlet@polygraph.io