idnits 2.17.1 draft-jiang-nmlrg-traffic-machine-learning-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 3, 2016) is 2884 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 2818 (Obsoleted by RFC 9110) -- Obsolete informational reference (is this intentional?): RFC 5246 (Obsoleted by RFC 8446) -- Obsolete informational reference (is this intentional?): RFC 7749 (Obsoleted by RFC 7991) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Machine Learning Research Group S. Jiang, Ed. 3 Internet-Draft B. Liu 4 Intended status: Informational Huawei Technologies Co., Ltd 5 Expires: December 5, 2016 P. Demestichas 6 University of Piraeus 7 J. Francois 8 Inria 9 G. M. Moura 10 SIDN Labs 11 P. Barlet 12 Network Polygraph 13 June 3, 2016 15 Use Cases of Applying Machine Learning Mechanism with Network Traffic 16 draft-jiang-nmlrg-traffic-machine-learning-00 18 Abstract 20 This document introduces a set of use cases in which machine learning 21 technologies are applied to network traffic relevant activities, 22 including machine learning based traffic classification, traffic 23 management, etc. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at http://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on December 5, 2016. 42 Copyright Notice 44 Copyright (c) 2016 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (http://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 60 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 3. Methodology of Learning from Traffic . . . . . . . . . . . . 4 62 3.1. Data of the Network Traffic . . . . . . . . . . . . . . . 4 63 3.2. Data Source and Storage . . . . . . . . . . . . . . . . . 5 64 3.3. Architecture Considerations . . . . . . . . . . . . . . . 5 65 3.4. Closed Control Loop . . . . . . . . . . . . . . . . . . . 6 66 4. Use Cases Study of Applying Machine Learning in Network . . . 6 67 4.1. HTTPS Traffic Classification . . . . . . . . . . . . . . 6 68 4.2. Malicious Domains: Automatic Detection with DNS Traffic 69 Analysis . . . . . . . . . . . . . . . . . . . . . . . . 9 70 4.3. Machine-learning based Policy Derivation and Evaluation 71 in Broadband Networks . . . . . . . . . . . . . . . . . . 10 72 4.4. Traffic Anomaly Detection in the Router . . . . . . . . . 11 73 4.5. Applications of Machine Learning to Flow Monitoring . . . 12 74 5. Security Considerations . . . . . . . . . . . . . . . . . . . 15 75 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 76 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 15 77 8. Change log [RFC Editor: Please remove] . . . . . . . . . . . 16 78 9. Informative References . . . . . . . . . . . . . . . . . . . 16 79 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17 81 1. Introduction 83 Machine learning technology has been successful in solving 84 complicated issues. It helps to make predictions or decisions based 85 on large datasets. It could also dynamically adapt to varying 86 situations and response to real-time issues. Therefore, more and 87 more research starts on applying machine learning in the network 88 area. 90 Among many aspects of networks, the network traffic is one of the 91 most complicated managed objectives. Its volume is rapidly growing 92 along with the Internet explosion. It is always dynamically 93 changing. Most network traffic flows only last a few minutes, or 94 even shorter. And the user contents within traffic is becoming more 95 diverse due to the development of various network services, and 96 increasing use of encryption. Consequently, it is more and more 97 challenging for administrators to get aware of the network's running 98 status and efficiently manage the network traffic flows. Although 99 more and more data regarding network traffics are generated, 100 traditional mechanisms based on pre-designed network traffic patterns 101 become less and less efficient. 103 It is natural to utilize powerful machine learning technology to 104 analyze the large mount of data regarding network traffic, to 105 understand the network's status, such as performance, failures, 106 security, etc. It is a big advantage that machines can measure and 107 analyse the network traffic, then report the results and predictions 108 to humans for further decision. The machines could handle vast 109 amounts of data which is almost impossible for humans to deal with, 110 in close to real time. Even more, if the speed and accuracy of the 111 prediction is high enough, it is possible that the subsequent action 112 based on the prediction result could form a closed control loop to 113 achieve autonomic management. However, the maturity of latter might 114 be far in the future. Today, the traditional control programs still 115 look more reliable than machine learning based control mechanisms. 117 This document firstly analyzes the data of the network traffic from 118 various perspectives; and also discusses several important practical 119 considerations, including the training data source, data storage and 120 the learning system architecture. It then introduce a set of use 121 cases, which have been shown to work well although there is large 122 scope for improvements, including ML-based traffic classification, 123 traffic management, interface failure prediction, etc. 125 Editor notice: this document is in the primary stage. It collects 126 the use cases presented in the proposed Network Machine Learning 127 Research Group (NMLRG) session in IETF95 meeting. 129 2. Terminology 131 The terminology defined in this document. 133 Machine Learning A computational mechanism that analyzes and learns 134 from data input, either historic data or real-time feedback data, 135 following a set of designed features and algorithms. It can be 136 used to make analysis, predictions or decisions, rather than 137 following strictly static program instructions. 139 Network Traffic The amount of data moving across a network at a 140 given point of time. They are mostly encapsulated in network 141 packets. 143 Traffic Flow A sequence of packets from a source computer to a 144 destination [RFC6437]. It is the unit of network traffic. 146 Feature (machine learning) In machine learning and pattern 147 recognition, a feature is an individual measurable property of a 148 phenomenon being observed. Choosing informative, discriminating 149 and independent features is a crucial step for effective 150 algorithms in pattern recognition, classification and regression. 152 Algorithm (machine learning) Machine learning algorithms operate by 153 building a model from example inputs in order to make data-driven 154 predictions or decisions expressed as outputs, rather than 155 following strictly static program instructions. A incomplete list 156 of machine learning algorithms includes supervised learning, 157 unsupervised learning, semi-supervised learning, reinforcement 158 learning, deep learning, etc. 160 3. Methodology of Learning from Traffic 162 3.1. Data of the Network Traffic 164 There is plenty of valuable data related to the network traffic. 165 These data are raw features in learning process. Following is a 166 simple classification of network traffic data. 168 Measurable properties There are many measurable properties of 169 network traffic, such as latency, number of packets, duration, 170 etc. These properties are also very essential features, 171 especially for use cases relevant to performance, QoS (Quality of 172 Service), etc. 174 Data within communication protocols The user contents are 175 encapsulated in layered communication protocols. Many information 176 are contained within the protocol headers, for example the source 177 and destination IP addresses in the IP header, the port numbers in 178 the TCP/UDP header, etc. Transport layer protocols are often 179 related to the type of applications, such as FTP (File Transfer 180 Protocol) for file transfer, HTTP (Hyper Text Transfer Protocol) 181 for web, etc; and many application-relevant data are embedded 182 within these protocols. These could also be essential data for 183 classification or application-oriented analysis. However, some 184 traffic will not provide transport or application information, due 185 to unknown protocols or encryption. 187 User content User contents are the payload of packets, which might 188 be obtained by DPI (Deep Packet Inspection) within the transit 189 network if the packets are unencrypted, or they could be analyzed 190 by the source or destination nodes. 192 Data in network signaling protocols Traffic flows are managed or 193 indirectly influenced by various network signaling protocols. For 194 example, the routing protocols determine the next hop of a 195 specific network traffic flow, or even the traffic path (by some 196 sophisticated routing protocol such as MPLS-TE (Multi-Protocol 197 Label Switching - Traffic Engineering), segment routing, etc.); 198 the P2P (Peer to Peer) protocol can even decide the destination of 199 a specific content traffic. They are relevant and are potential 200 features for traffic analysis. Furthermore, the traffic of these 201 signaling protocols themselves may also be learning objectives. 203 3.2. Data Source and Storage 205 Within networks, forwarding devices such as routers, switches, 206 firewalls, etc., are the entities that directly handle the network 207 traffic. Thus, they could collect network traffic data, such as 208 measurable properties, protocol information, etc. Source nodes or 209 destination nodes, particularly servers, could also be the source of 210 network traffic data. They could either report the collected data to 211 a central repository for storage and learning, or collect and store 212 the data by themselves for local learning. This depends on the 213 learning architecture, which is discussed in the following section. 215 3.3. Architecture Considerations 217 Global learning vs. local learning 219 * Global learning refers to the tasks that are mostly network- 220 level, so that they need to be done in a global viewpoint. In 221 this case, the learning entity is normally centralized and is 222 different from the data source entities. 224 * Local learning is more applicable to the tasks that are only 225 relevant to one or a limited group of devices, and they could 226 be done directly within that one node or that limited group of 227 nodes. In this case of grouped nodes, the data may also need 228 to be transited from the data source entity to learning entity. 230 Offline & online learning 232 * Co-located mode: training (offline, based on historic data) and 233 prediction (online, based on real-time data) are both done 234 within the same entity. The entity could be a central 235 repository or a specific node. 237 * De-coupled mode: training is done in the central repository, 238 and prediction is made by the routers/switches/firewalls or 239 other devices that directly process the network traffic. 241 Central learning & distributed learning Central learning means the 242 learning process is done at a single entity, which is either a 243 central repository or a node. Distributed learning refer to 244 ensemble learning that multiple entities do the learning 245 simultaneously and ensemble the results together to sort out a 246 final results. Since network devices are naturally distributed, 247 it could be foreseen that ensemble learning is a good approach for 248 a certain of use cases. 250 3.4. Closed Control Loop 252 The prediction made by machine learning mechanism could be directly 253 used on manipulating the network traffic, or other relevant actions, 254 such as changing the device configuration, etc. 256 However, as the introduction section said, this kind of utilization 257 might be suitable only for a small set of the use cases, due to the 258 limited accuracy of machine learning technologies. Besides, some 259 critical usages simply cannot tolerate any false decision. 261 4. Use Cases Study of Applying Machine Learning in Network 263 Editor notes: This section is a collection of the work presented in 264 the proposed NMLRG session in IETF95 meeting. More contributions on 265 use cases are welcome. 267 4.1. HTTPS Traffic Classification 269 Managing network traffic requires a good understanding of the content 270 of traffic flows for various purposes. Indeed, enhancing the QoS by 271 prioritizing or scheduling the flows or enforcing security policies 272 by filtering some of them cannot solely on rely protocol headers like 273 IP, TCP or UDP headers. Analyzing the user content with DPI is so 274 necessary. However, this poses serious concerns regarding the user 275 privacy. In addition, OTT (Over-the-Top) actors would prefer to 276 fully control their network traffic rather than being subject to any 277 intermediaries policies. As a result, encrypting the traffic has 278 been widely adopted in last years. 280 In that context, traffic management is facing to severe difficulties 281 since DPI is not efficient anymore. Using an intermediary service or 282 proxy are the only ways to analyze the content of encrypted traffic 283 but it requires a high trustfulness in the intermediaries and so not 284 always guaranteed, for example with end-users of an operator 285 networks. 287 Therefore, new techniques wit the ability to extract knowledge and 288 insight from encrypted flows is necessary. Especially HTTPS 290 [RFC2818] is now a major protocol use over Internet because it 291 provides secure Web communication while Web is now embracing various 292 services which have been provided apart in the past: email, video 293 streaming, chat, VoIP, file sharing, etc. It relies on TLS 294 (Transport Layer Security) [RFC5246], [RFC6066] to encapsulate HTTP 295 requests. 297 Being able to identify the service and the providers of an HTTPS 298 connection would help in applying different strategies for managing 299 the corresponding flow. For instance, VoIP (Voice over IP) and email 300 do not require the same QoS or some service use might be prohibited 301 like file sharing to avoid data leakage in a company. 303 As a concrete example, Google, Facebook or Amazon are service 304 providers while maps, drive, gmail are services of Google. To 305 identify them when they are accessed by a user, IP addresses and DNS 306 (Domain Name System) names based identification is not reliable as 307 the users can relies on intermediates to respectively serve as proxy 308 or resolve DNS requests. The SNI (Server Name Indication) [RFC5246] 309 is an extension of HTTPS which is indicated by the user when 310 initiating the TLS handshake (Client Hello). SNI actually contains 311 the hostname to which the request is addressed. Such an hostname is 312 significative of the service and service provider name. However, SNI 313 is an optional field and can be easily forged to circumvent HTTPS 314 filtering without impacting service use [bypasssni]. More advanced 315 mechanisms are hence necessary to improve the robustness of 316 identification even in the case of non collaborative users. 318 Because the objective is to automatically label an HTTPS connection 319 by the service and service provider associated with. The TLS 320 handshake is not encrypted but data exchanged during this phase 321 (random number, selected ciphers,...) is not distinctive of the 322 accessed service. However, the nature of accessed service directly 323 impacts on user content transmitted through the secure channel 324 especially on the type, size and way to transmit those data. Such 325 metadata are still measurable properties. 327 HTTPS Connection 328 + 329 |(1) 330 +-------v------+ 331 |TLS Connection| 332 |Reconstruction| 333 +-------+------+ 334 |(2) 335 +-------v------+ (3') (4') 336 | Features +-------------+----------------------------+ 337 | Extraction | | | 338 +-------+------+ +-------v---------+ +----v----+ 339 | |Service Provider +------------->Services | 340 |(3) |L1 model | Load |L2 model | 341 | +-------^---------+ services +----^----+ 342 +-------v------+ | model X | 343 |SNI Labelling | +----------------------------+ 344 +-------+------+ |(5) 345 | +-----------------------------------------+ 346 +------------> Training and | 347 (4) | Models building | 348 +-----------------------------------------+ 350 Two-levels HTTPS traffic classification 352 In figure above, step(1) consists in reconstructing the HTTPS 353 connection and retrieving packets on top of which the following 354 metrics are observed (2): 356 o Inter Arrival Time 358 o Packet size 360 o Encrypted data size: this feature has the advantage to be strongly 361 related to the service accessed instead of the packet size which 362 is biased by other lower layer headers 364 Based on these values, aggregated features are computed: average, 365 minimum, maximum, 25th percentile, median, 75th percentile. 367 Because different providers may offer a similar service, a single 368 classifier could fail to to distinguish them. A multi-level machine 369 learning approach has been proposed. For learning, a dataset without 370 forged SNI is used (3) to build the classifiers (4). The result is 371 (5): 373 o a first level model (L1 model) whose the goal is to identify the 374 service provider, 376 o a set of second level models (L2 models), one for each service 377 provider to identify specific service of a service provider 379 Once all classifiers are trained, a new unknown HTTPS connection is 380 first matched against the LV1 model (3'). The output is the 381 predicted service provider but also leads to load the corresponding 382 LV2 model (4') to determine the specific service of this service 383 provider. 385 This framework is independent of the ML technique. being used. Each 386 model could be also built with a different technique but our study 387 have shown that best results are obtained with Random Forest. 389 The HTTPS classification framework has been tested over 288,901 390 connections from lab users. Standard evaluation procedure have been 391 applied. Less representative features have been automatically 392 discarded. Using a ten-fold cross-validation, each tested connection 393 has been marked as perfect identification (both the service provider 394 and the service name are rightly identified), partial identification 395 (only the service provider is identified) or invalid (none of them). 396 93.1% falls in the first category, 2.9% in the second and the rest in 397 the third. Full results are available in [httpsframework]. 399 Although results are promising, the current method can only be 400 applied at the end once the HTTPS connection, i.e. after being 401 reconstructed. This avoids to apply any kind of policies to the 402 corresponding traffic flow. Future challenge is thus to classify the 403 connection before it ends in order to apply. 405 4.2. Malicious Domains: Automatic Detection with DNS Traffic Analysis 407 Since their inception, domain names have been used to provide a 408 simple identification label for hosts, services, applications, and 409 networks on the Internet [RFC1034]. In the same way, domains and the 410 DNS infrastructure have also been misused in various types of abuses, 411 such as phishing, spam, malware distribution, among others. 413 Newly registered malicious domain names are well-know to a very 414 distinct initial DNS lookup pattern than legitimates ones: typically, 415 they exhibit an abnormally higher number of lookups [Hao2011]. One 416 of the reasons is that malicious domains tend to rely upon spam 417 campaigns within the first ours after the registration of these 418 domains in order to maximize the number of victims before the domain 419 is detected and taken down. 421 In order to protect users from such domains, nDEWS (New Domains Early 422 Warning System) [Moura2016], a tool that classifies the newly 423 registered domains based on their initial lookup pattern, has been 424 proposed. To perform that, it is required to have access to (i) a 425 domains registration database and (ii) authoritative DNS server 426 traffic data, which is typically the case for Top-Level Domains (TLD) 427 registries. These domains are classified using k-means as a 428 clustering method into two clusters using four features extracted 429 from the analyzed DNS traffic: # DNS queries, # IP addresses, # 430 Autonomous Systems (ASes), and # Countries, which were chosen 431 empirically. 433 As a result, in an automated fashion, a large variety of suspicious 434 domains can be detected, including phishing, malware, but also other 435 types, such as fake pharmaceutical shops as well as counterfeit 436 sneakers. In this particular case, the responsible registrars are 437 notified in this pilot study about these websites. Ultimately, it 438 allows these websites to be taken down, minimizing the potential 439 number of victims. 441 4.3. Machine-learning based Policy Derivation and Evaluation in 442 Broadband Networks 444 Service provisioning is becoming more complex. For instance, there 445 are services having diverse quality requirements, there is variance 446 of the requirements in time and space, and there is the need for 447 utmost resource efficiency. Moreover, full agility in time and space 448 (in order to accomplish resource efficient service provisioning) 449 requires the solution of computationally intensive tasks. In this 450 respect, policies can play a role: specify the network behaviour in 451 time periods and service area regions. 453 In this direction, machine learning can have a fundamental role, 454 e.g., for learning situations encountered and "good" ways (policies) 455 for handling them. The contribution addresses the role that machine 456 learning can play for policy derivation and evaluation. In more 457 detail it addresses the requirements on the role of machine learning, 458 including potential inputs and outputs. 460 Knowledge and machine learning can be an important aspect of wireless 461 networks. Knowledge is created both regarding the contexts and their 462 occurrence, as well as on the association of the context with 463 specific actions and its scoring. The latter encompasses development 464 of knowledge on how to handle acquired contexts; this knowledge will 465 include the contexts encountered, the corresponding handlings done 466 (decisions applied), the potential alternative handlings, and the 467 respective efficiency of each handling (actually applied or 468 alternate). 470 Reinforcing "good" solutions per each encountered context (e.g. 471 reinforcement learning) can be a vital and unique element of a 472 knowledge-based management system. Machine learning can be realized 473 through clustering to discover underlying structures in data, 474 regression to identify patterns and predict values in cell and 475 network usage, classification to classify first-seen unknown users, 476 and density estimation to model complex user behavior and network 477 usage. Several deep architectures and techniques (such as pre- 478 training) can be utilized, in order to generalize better on complex 479 data with underlying information and be able to make accurate 480 predictions, even on unseen data. 482 As a result, depending on what we want to achieve, the proper machine 483 learning approach can be used. 485 Through machine learning it will be possible to provide faster and 486 targeted solutions to specific network problems. Moreover, it is 487 possible cluster various usage profiles and prioritize the traffic 488 according to the criticality level. For instance, mission critical 489 services need special attention with respect to latency and 490 prioritization, compared to plain services which may tolerate a bit 491 of delay without jeopardizing the overall quality. In addition, 492 machine learning can lead to improved results in KPIs (Key 493 Performance Indicator) such as end-user throughput, latency, energy 494 consumption and overall cost effectiveness. Moreover, reliability 495 can be increased since certain problematic situations may be 496 predicted before happening, hence it will be possible to act pro- 497 actively and alleviate the negative impact of a problem in the 498 network. 500 It is evident that machine learning can have significant importance 501 in policy derivation and evaluation in broadband networks, especially 502 towards in 5G infrastructures which will be complex, heterogeneous 503 and need to accommodate multi-services ranging from mobile broadband 504 to massive machine type, mission critical and vehicular 505 communications. 507 4.4. Traffic Anomaly Detection in the Router 509 Modern routers usually have the capability that makes alarms of high 510 bandwidth usage rate of a specific interface. When network traffic 511 exceeds a certain threshold, the router will consider it as an 512 anomaly event and report it to the NMS (Network Management System). 513 For instance, in some routers/switches, there exists configuration 514 such as "trap-threshold { input-rate | output-rate }" to trigger 515 traffic alarms, which is statically configured by experienced 516 administrators. However, network traffic is usually not static and 517 even changes significantly due to the changes of carried services, 518 residential situation, and etc. Thus, static configuration could not 519 effectively identify the traffic anomaly events. 521 To address above issue, machine learning technologies are applied for 522 routers/switches to learn local traffic pattern and detect the 523 traffic anomaly events based on the learning results. 525 Wavelets are employed to analyze time-series network traffic for 526 anomaly detection. In some certain interval, the routers measure, 527 record, and analyze the input and output traffic rates respectively, 528 or in the form of rate sums. (The former is recommended for a finer 529 granularity analysis.) 531 Running for some time, the router would get a set of "time-rate" 532 data, collected as time-series waves for further wavelet analysis. 533 Besides wavelets, this use case proposes other machine learning 534 techniques such as outlier detection. For this way, features are to 535 be extracted from wavelets for supervised or unsupervised learning. 537 After data collection, the router would sort up the data and figure 538 out the alarm threshold statistically based on data distribution, to 539 discriminate the normal and outlier traffic rates. When interface 540 traffic exceeds the threshold, the router would make alarms to the 541 NMS. The router could dynamically adjust the alarm threshold with 542 new coming data, by periodical anomaly analysis. This approach helps 543 devices detect traffic anomaly more efficiently and effectively, 544 compared to traditional way of learning at the central repository 545 that collects traffic information from various devices. 547 This use case could be extended from single interface to multiple 548 ones, that is, device scope of multiple traffic waves, and even wider 549 scope of multiple devices in a certain domain. Thus would make the 550 analysis more comprehensive. 552 Besides wavelet analysis, there might be more techniques to explore, 553 such as correlation analysis of traffic anomaly events among multiple 554 devices. 556 4.5. Applications of Machine Learning to Flow Monitoring 558 A commercial cloud-based flow monitoring service from Network 559 Polygraph [polygraph] has used Machine Learning analysis as a cost- 560 effective alternative to DPI for traffic classification, which 561 identifies the application responsible for each network traffic flow. 563 Nowadays, DPI is considered as the standard technology for traffic 564 classification. However, DPI is generally expensive as it requires 565 the analysis of the payload of every single packet. This usually 566 involves the use of powerful, specialized hardware appliances, which 567 need to be deployed in every link to obtain full coverage of the 568 network. In the case of Network Polygraph, the use of DPI is 569 impractical, because the volume of data to be exported to the cloud 570 would be overwhelming (i.e., all traffic should be replicated). A 571 more viable alternative is the use flow-based monitoring 572 technologies, such as NetFlow [RFC3954] or IPFIX [RFC7011], where the 573 volume of exported data is significantly lower. Flow-based 574 monitoring technologies provide summarized information (e.g., 575 duration, traffic volume) for every connection (or "traffic flow") 576 handled by a router. The information available in flow records is 577 more limited compared to DPI (e.g., packet payloads are not 578 available). As a result, most flow-based monitoring tools base their 579 classification on the port numbers or simple heuristics, which are 580 known to be highly unreliable. 582 To address this problem, Network Polygraph uses a traffic 583 classification approach based on ML. Several studies showed that 584 supervised learning can achieve similar classification accuracy to 585 DPI at a fraction of its cost. However, supervised methods suffer 586 from some practical limitations that make them very difficult to 587 deploy and maintain in production environments. For example, they 588 require a costly training phase prior to its deployment and need to 589 be frequently retrained, every time there is a change in the network 590 or in the network applications. 592 This section describes the ML approach used by Network Polygraph for 593 online classification of NetFlow/IPFIX traffic. To solve the 594 practical limitations of supervised learning, Network Polygraph 595 incorporates an automatic retraining system. Figure 1 shows the 596 components and data flow of the classification engine, which is 597 divided in two parts: 599 o The classification path (Figure 1, top) is in charge of the 600 classification of the traffic online using ML. The input of the 601 classification path are the NetFlow/IPFIX flows exported by the 602 routers, while the output are the classified flows. Several 603 traffic features are extracted from each flow, including the 604 information directly available in the flow records (e.g., 605 addresses, ports, packet and byte counts) together with some 606 features we construct (e.g., average packet size, rate and 607 interarrival time). The traffic features are the input of the 608 traffic classification algorithm, whose function is to identify 609 the application that generated the flow. Among the different 610 supervised algorithms, a C5.0 decision tree was selected, because 611 it has been shown to present the best accuracy/cost ratio for 612 traffic classification. Other supervised methods, e.g., Support 613 Vector Machine (SVM) and Artificial Neural Network (ANN), obtain 614 similar accuracy, but classification and training times are faster 615 with decision trees. In Network Polygraph, training times are 616 critical as the training path is continuously updating the 617 classification model in the background. 619 o The training path (Figure 1, bottom) implements the automatic 620 retraining system, which is responsible of automatically updating 621 the classification model when it becomes obsolete. To that end, a 622 random packet-level sample of the network traffic is continuously 623 collected using flow-based sampling. Sampled flows are then 624 labeled using DPI. It is possible to use DPI in the training path 625 because training can be performed only with a small data sample 626 (e.g., 1/1000 flows). This significantly reduces the 627 computational overhead and volume of data to be exported. The 628 labeled sample is used to verify the accuracy of the 629 classification model. The system accuracy is estimated by 630 comparing the output of DPI (training path) and C5.0 631 (classification path) for those flows sampled in the training 632 path. If the estimated accuracy falls below a configurable 633 threshold, the labeled sample is used to generate an updated model 634 using only those features available in NetFlow/IPFIX (IP Flow 635 Information Export) records. This training process can also be 636 performed in few vantage points, and use it for other networks 637 where only NetFlow/IPFIX monitoring data is available. 639 CLASSIFICATION PATH 641 NetFlow/ +----------+ +----------+ Classified 642 IPFIX | Feature | | C5.0 | flows 643 +-------->|Extraction+------------------------>|Classifier+-----------> 644 | | | | 645 +----------+ +----------+ 646 ^ 647 | 648 TRAINING PATH +----------+ +----------+ | 649 | NetFlow/ | | Feature | | Retraining 650 +-->| IPFIX +-->|Extraction+--+ | 651 Packet stream | |Generation| | | | | 652 (flow sampling) | +----------+ +----------+ | | 653 +--------------->| +--+ DPI-labeled 654 | +----------+ | NetFlow/ 655 | | DPI | | IPFIX 656 +---------->| App. +---------+ 657 | Labeling | 658 +----------+ 660 Network Polygraph classification engine data flow 662 Figure 1 664 In order to validate the performance of the described ML approach, 665 the accuracy of Network Polygraph was measured using a complete 666 14-day trace from the 10-Gigabit link that connects the Catalan 667 Research and Education Network (Anella Cientifica) to its Spanish 668 counterpart (RedIRIS). The trace contained about 70 million flows 669 with a flow sampling rate of 1/400. The experimental results showed 670 that, with a 96% retraining threshold, the system sustained an 671 average classification accuracy of 97.5%, needing only 15 retrainings 672 during the 14 days, which were performed automatically without 673 requiring any human intervention. When the retraining threshold was 674 decreased to 94%, the accuracy was slightly reduced to 96.76% with 675 only 5 retrainings. 677 The target objective is to progressively reduce the dependence on DPI 678 technologies, which are expensive, difficult to deploy, not scalable, 679 and not robust against encryption, in favor of flow-based machine 680 learning approaches that are more cost-effective and can be easily 681 offered as a cloud service. In this direction, some research 682 challenges include the classification of web services and CDN traffic 683 from flow-based measurements, and the combination of multiple ground 684 truths obtained from vantage points in different networks. 686 5. Security Considerations 688 This document is focused on applying machine learning in network, 689 including of course applying machine learning in network security, on 690 higher-layer concepts. Therefore, it does not itself create any new 691 security issues. 693 6. IANA Considerations 695 This memo includes no request to IANA. 697 7. Acknowledgements 699 The authors would like to acknowledge Josep Sanjuas, Andreas 700 Georgakopoulos, Kostas Tsagkaris, Valentin Carela, Wazen M. Shbair, 701 Thibault Cholez, and Isabelle Chrisment for their contributions. 703 The author would like to acknowledge the valuable comments made by 704 participants in the IRTF Network Machine Learning Research Group, 705 particular thanks to Lars Eggert, Brian Carpenter, Albert Cabellos, 706 Shufan Ji, Susan Hares, Rudra Saha, and Dacheng Zhang. 708 Jerome Francois was partly funded by Flamingo, a Network of 709 Excellence project (ICT-318488) supported by the European Commission 710 under its 7th Framework Programme. 712 This document was produced using the xml2rfc tool [RFC7749]. 714 8. Change log [RFC Editor: Please remove] 716 draft-jiang-nmlrg-traffic-machine-learning-00: original version, 717 2016-06-03. 719 9. Informative References 721 [bypasssni] 722 Shbair, W., Cholez, T., Goichot, A., and I. Chrisment, 723 "Efficiently Bypassing SNI-based HTTPS Filtering", IFIP/ 724 IEEE International Symposium on Integrated Network 725 Management (IM2015) , 2015. 727 [Hao2011] Hao, S., Feamster, N., and R. Pandrangi, "Monitoring the 728 Initial DNS Behavior of Malicious Domains", Proceedings of 729 the 2011 ACM SIGCOMM Conference on Internet Measurement 730 Conference (IMC 2011) , Nov 2011. 732 [httpsframework] 733 Shbair, W., Cholez, T., Francois, J., and I. Chrisment, "A 734 Multi-Level Framework to Identify HTTPS Services", IEEE/ 735 IFIP Network Operations and Management Symposium , 2016. 737 [Moura2016] 738 M. Moura, G., Mueller, M., Wullink, M., and C. Hesselman, 739 "nDEWS: a New Domains Early Warning System for TLDs", 740 IEEE/IFIP International Workshop on Analytics for Network 741 and Service Management (AnNet 2016), co-located with IEEE/ 742 IFIP Network Operations and Management Symposium (NOMS 743 2016) , 04 2016. 745 [polygraph] 746 "Network Polygraph", . 748 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 749 STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987, 750 . 752 [RFC2818] Rescorla, E., "HTTP Over TLS", RFC 2818, 753 DOI 10.17487/RFC2818, May 2000, 754 . 756 [RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export 757 Version 9", RFC 3954, DOI 10.17487/RFC3954, October 2004, 758 . 760 [RFC5246] Dierks, T. and E. Rescorla, "The Transport Layer Security 761 (TLS) Protocol Version 1.2", RFC 5246, 762 DOI 10.17487/RFC5246, August 2008, 763 . 765 [RFC6066] Eastlake 3rd, D., "Transport Layer Security (TLS) 766 Extensions: Extension Definitions", RFC 6066, 767 DOI 10.17487/RFC6066, January 2011, 768 . 770 [RFC6437] Amante, S., Carpenter, B., Jiang, S., and J. Rajahalme, 771 "IPv6 Flow Label Specification", RFC 6437, 772 DOI 10.17487/RFC6437, November 2011, 773 . 775 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 776 "Specification of the IP Flow Information Export (IPFIX) 777 Protocol for the Exchange of Flow Information", STD 77, 778 RFC 7011, DOI 10.17487/RFC7011, September 2013, 779 . 781 [RFC7749] Reschke, J., "The "xml2rfc" Version 2 Vocabulary", 782 RFC 7749, DOI 10.17487/RFC7749, February 2016, 783 . 785 Authors' Addresses 787 Sheng Jiang (editor) 788 Huawei Technologies Co., Ltd 789 Q 22, Huawei Campus, No.156 Beiqing Road 790 Hai-Dian District, Beijing, 100095 791 P.R. China 793 Email: jiangsheng@huawei.com 795 Bing Liu 796 Huawei Technologies Co., Ltd 797 Q 22, Huawei Campus, No.156 Beiqing Road 798 Hai-Dian District, Beijing, 100095 799 P.R. China 801 Email: leo.liubing@huawei.com 802 Panagiotis Demestichas 803 University of Piraeus 804 Piraeus 805 Greece 807 Email: pdemestichas@gmail.com 809 Jerome Francois 810 Inria 811 615 rue du jardin botanique 812 54600 Villers-les-Nancy 813 France 815 Email: jerome.francois@inria.fr 817 Giovane C. M. Moura 818 SIDN Labs 819 Meander 501 820 Arnhem, 6825 MD 821 The Netherlands 823 Email: giovane.moura@sidn.nl 825 Pere Barlet 826 Network Polygraph 827 Edifici K2M - Parc UPC 828 Jordi Girona, 1-3, Barcelona 08034 829 Spain 831 Email: pbarlet@polygraph.io