idnits 2.17.1 draft-perumal-nfvrg-nfv-compute-acceleration-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 50 instances of lines with control characters in the document. ** The abstract seems to contain references ([3], [6], [7], [10], [11]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 265 has weird spacing: '...mber of netwo...' == Line 274 has weird spacing: '... pinned memor...' == Line 593 has weird spacing: '...ring of servi...' == Line 595 has weird spacing: '...ice for proce...' == Line 691 has weird spacing: '...ion and decla...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: 'RFC2119' on line 159 Summary: 2 errors (**), 0 flaws (~~), 7 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFVRG 3 Internet-Draft Bose Perumal 4 Intended Status: Informational Wenjing Chu 5 R. Krishnan 6 Hemalathaa. S 7 Dell 8 Expires: December 3 2015 June 29 2015 10 NFV Compute Acceleration Evaluation and APIs 11 draft-perumal-nfvrg-nfv-compute-acceleration-00 13 Abstract 15 Network functions are being virtualized and moved to industry 16 standard servers. Steady growth of traffic volume requires more 17 compute power to process the network functions. Network packet based 18 architecture provides a lot of scope for parallel processing. Generic 19 parallel processing can be done in common multicore platforms like 20 GPUs, coprocessors like Intel Xeon Phi[6][7] and Intel[7]/AMD[10] 21 multicore CPUs. In this draft to check the feasibility and to exploit 22 this parallel processing capability, multi string matching is taken 23 as the sample network function for URL filtering. Aho-Corasick 24 algorithm has been made use for multi pattern matching. 25 Implementation utilizes OpenCL [3] to support many common 26 platforms[7][10][11]. A list of optimizations is done, the 27 application is tested on Nvidia Tesla K10 GPUs. A common API for NFV 28 Compute Acceleration has been proposed. 30 Status of this Memo 32 This Internet-Draft is submitted to IETF in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF), its areas, and its working groups. Note that 37 other groups may also distribute working documents as 38 Internet-Drafts. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 The list of current Internet-Drafts can be accessed at 46 http://www.ietf.org/1id-abstracts.html 47 The list of Internet-Draft Shadow Directories can be accessed at 48 http://www.ietf.org/shadow.html 50 Copyright and License Notice 52 Copyright (c) 2015 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 68 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 5 69 2. OpenCL based Virtual Network Function Architecture . . . . . . 6 70 2.1 CPU Process . . . . . . . . . . . . . . . . . . . . . . . . 7 71 2.2 Device Discovery . . . . . . . . . . . . . . . . . . . . . . 7 72 2.3 Mixed Version Support . . . . . . . . . . . . . . . . . . . 7 73 2.4 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . 8 74 3. Aho-Corasick Algorithm . . . . . . . . . . . . . . . . . . . . 9 75 4. Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 9 76 4.1 Variable size packet packing . . . . . . . . . . . . . . . . 9 77 4.2 Pinned Memory . . . . . . . . . . . . . . . . . . . . . . . 10 78 4.3 Pipelined Scheduler . . . . . . . . . . . . . . . . . . . . 10 79 4.4 Reduce Global memory access . . . . . . . . . . . . . . . . 10 80 4.5 Organizing GPU cores . . . . . . . . . . . . . . . . . . . . 10 81 5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 82 5.1 Worst case 0 string match . . . . . . . . . . . . . . . . . 11 83 5.2 Packet match . . . . . . . . . . . . . . . . . . . . . . . . 12 84 6. Compute Acceleration API . . . . . . . . . . . . . . . . . . . 13 85 6.1 Add Network Function . . . . . . . . . . . . . . . . . . . . 13 86 6.2 Add Traffic Stream . . . . . . . . . . . . . . . . . . . . . 14 87 6.3 Add Packets to Buffer . . . . . . . . . . . . . . . . . . . 16 88 6.4 Process Packets . . . . . . . . . . . . . . . . . . . . . . 16 89 6.5 Event Notification . . . . . . . . . . . . . . . . . . . . . 16 90 6.6 Read Results . . . . . . . . . . . . . . . . . . . . . . . . 17 91 7. Other Accelerators . . . . . . . . . . . . . . . . . . . . . . 17 92 8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 17 93 9. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 18 94 10 Security Considerations . . . . . . . . . . . . . . . . . . . 18 95 11 IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 96 12 References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 97 12.1 Normative References . . . . . . . . . . . . . . . . . . . 18 98 12.2 Informative References . . . . . . . . . . . . . . . . . . 18 99 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . 19 100 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 102 1 Introduction 104 Network equipment vendors use specialized hardware to process data at 105 a low latency and high throughput. Packet processing above 4 Gb/s is 106 done using expensive, purpose-built application-specific integrated 107 circuits. However, the low unit volumes force manufacturers to price 108 these devices at many times the cost of producing them, to recover 109 the R&D cost. 111 Network Function Virtualization (NFV)[1] is a key emerging area for 112 network operators, hardware and software vendors, cloud service 113 providers, and in general network practitioners and researchers. NFV 114 introduces virtualization technologies into the core network to 115 create a more intelligent, more agile service infrastructure. Network 116 functions that are traditionally implemented in dedicated hardware 117 appliances will need to be decomposed and executed in virtual 118 machines running in data centers. The parallelism of graphics 119 processor provides it the potential to function as network 120 coprocessor. 122 Network virtual function is responsible for specific treatment of 123 received packets. A network virtual function can act at various 124 layers of a protocol stack. When there is more compute power, 125 multiple virtual network functions can be executed in a single system 126 or VM. When multiple virtual network functions are processed in a 127 system, some of them could be processed in parallel with other 128 network functions. This paper proposes a method to represent ordered 129 set of virtual network functions in a combination of a sequential and 130 parallel order. This draft is for software based network functions, 131 so any further reference to network function means virtual network 132 function. 134 Software written for specialized hardware like network processors, 135 ASIC, FPGA, is closely tied to the hardware and specific vendor 136 products. It cannot be reused in other hardware platforms. For 137 generic compute acceleration different hardware platforms can be 138 used, like GPUs from different vendors, Intel Xeon Phi coprocessors 139 and multi core CPUs from different vendors. All these compute 140 acceleration platforms support OpenCL as parallel programming 141 language. Instead of every vendor writing OpenCL code, NFV Compute 142 Acceleration (NCA) API has been proposed for a common compute 143 accelerator in this paper. This API will be a library with C API 144 functions for declaring the network functions as an ordered set and 145 moving packets around. 147 Multi-pattern string matching is used in a number of applications 148 including network intrusion detection and digital forensics. Hence 149 multi pattern matching is chosen as a sample network function. Aho- 150 Corasick[2] algorithm with few modifications has been used to find 151 the first occurrence of any pattern from the signature database. 152 Based on this network function the throughput numbers are measured. 154 1.1 Terminology 156 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 157 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 158 document are to be interpreted as described in RFC 2119 [RFC2119]. 160 2. OpenCL based Virtual Network Function Architecture 162 Network functions like multi pattern matching is process intensive 163 and common for multiple NFV applications. Generic compute 164 acceleration framework functions and service specific functions are 165 clearly separated in this prototype. The architecture diagram with 166 one network function is shown in Figure 1. Multiple network functions 167 can also be loaded. Most of the signature based algorithms like Aho- 168 Corasick[2], Regex[8],etc. generate a Deterministic Finite 169 Automaton(DFA)[2][8]. DFA database is generated in CPU and loaded to 170 the accelerator. Kernels executed in the accelerator will use the 171 DFA. 173 +----------------------------------------+ +-------------------+ 174 | CPU Process | | GPU/Xeon Phi,etc. | 175 | | | | 176 | | | | 177 | | | | 178 | Scheduler | | | 179 | +------------+ +------------+ | | +------------+ | 180 | | Packet | |Copy Packet | | | |Input Buffer| | 181 | |Generator +----->|CPU to GPU +------------->|P1,P2,...,Pn| | 182 | +------------+ +------------+ | | +-----+------+ | 183 | | | | | 184 | +------------+ | | +-----v------+ | 185 | |Launch GPU | | | |GPU Kernels | | 186 | |Kernels +------------->|K1,K2,...,Kn|<+ | 187 | +------------+ | | +-----+------+ | | 188 | | | | | | 189 | +-------------+ +------------+ | | +-----v------+ | | 190 | |Results for | |Copy Results| | | |Result Buf | | | 191 | |each packet +<-----+GPU to CPU |<-------------+R1,R2,...,Rn| | | 192 | +-------------+ +------------+ | | +------------+ | | 193 | | | | | 194 | | | | | 195 | +----------------+ +-----------+ | | +------------+ | | 196 | |Network Function| | NF | | | | NF | | | 197 | |(AC,Regex,etc) +-->| Database +------------->| Database +-+ | 198 | +-------+--------+ +-----------+ | | +------------+ | 199 | ^ | | | 200 +-----------|----------------------------+ +-------------------+ 201 | 202 +----+------+ 203 | Signature | 204 | Database | 205 +-----------+ 206 Figure 1. OpenCl based Virtual Network Function 207 Software Architecture Diagram 209 2.1 CPU Process 211 Accelerators like GPUs or coprocessors will augment CPU and currently 212 they cannot function alone. Network virtual function is split between 213 CPU and GPU. CPU process owns the packet preprocessing, packet 214 movement and scheduling. GPUs will do the core functionality of the 215 network functions. CPU process interfaces between the packet I/O and 216 GPU. During initialization it does set of following functions. 218 1. Device Discovery 219 2. Initialize OpenCL object model 220 3. Initialize memory module 221 4. Initialize network functions 222 5. Trigger scheduler 224 2.2 Device Discovery 226 Using OpenCL functions device discovery module discovers the 227 platforms and devices. Based on number of devices discovered, device 228 context and command queues are created. 230 2.3 Mixed Version Support 232 OpenCl is designed to support devices with different capabilities 233 under a single platform[3]. There are three version identifiers in 234 OpenCl, the platform version, the version of a device, and the 235 version(s) of the OpenCl C language supported on a device. 237 The platform version indicates the version of the OpenCL runtime 238 supported. The device version is an indication of the devices 239 capabilities. The language version for a device represents the OpenCL 240 programming language features a developer can assume are supported on 241 a given device. 243 OpenCl C is designed to be backwards compatible, so a device is not 244 required to support more than a single language version to be 245 considered conformant. If multiple language versions are supported, 246 the compiler defaults to using the highest language version supported 247 for the device. 249 Code written for old device version may not utilize the full 250 capabilities of new device if there are hardware architectural 251 changes. 253 2.4 Scheduler 255 Scheduling between the packet buffers coming from the network I/O to 256 the device command queues is carried out by the scheduler. Scheduler 257 operates on following parameters. 259 N - Number of Packet buffers (Default 6) 260 M - Number of Packets in each buffer (Default 16384) 261 K - Number of Devices (Discovered 2) 262 J - Number of Command Queues for each device (Default 3) 263 I - Number of Commands to the device to complete 264 single network function (Default 3) 265 S - Number of network functions executed in parallel. (Default 1) 267 Default values mentioned above are for the best results in our 268 current hardware environment and multi string match function. 270 Operations for completing network function for one packet buffer 271 1. Identify a free command queue 272 2. Copy packets from IO memory to pinned memory for GPU 273 3. Fill Kernel function parameters 274 4. Copy pinned memory to GPU global memory 275 5. Launch kernels for number of packets in the packet buffer 276 6. Check kernel execution completion and collect results 277 7. Report results to application 279 Scheduler calls OpenCl API with number of kernels to be executed in 280 parallel. Distributing the kernels to cores is taken care by OpenCl 281 library. If there are any error during launching the kernels, OpenCl 282 API returns error and appropriate error handling can he done. 284 3. Aho-Corasick Algorithm 286 The Aho-Corasick algorithm [2] is the most effective multi pattern 287 matching algorithm. Aho-Corasick algorithm is a kind of dictionary- 288 matching algorithm that locates elements of a finite set of strings 289 within an input text. The complexity of the algorithm is linear in 290 the length of the patterns plus the length of the searched text plus 291 the number of output matches. 293 The algorithm works in two parts. The first part is the building of 294 the tree from keywords that needs to be searched for, and the second 295 part is searching the text for the keywords using the previously 296 built tree (state machine). Searching for a keyword is efficient, 297 because it only moves through the states in the state machine. If a 298 character is match, goto () function is executed otherwise it follows 299 fail () function. Match found is returned by the out () function. 301 All the three functions just access the indexed data structures and 302 return the value. goto () data structure is a two dimension matrix 303 accessed based on current state and currently compared character. 304 fail () function is an array, which has the link to alternate path 305 for each state. Out function is an array of states and it has the 306 records on whether the string search has completed on a particular 307 state. 309 Based on the signature database, all three data structures are 310 constructed in CPU. These data structures are copied to GPU global 311 memory during the initialization stage. Pointers to these data 312 structures are passed as the kernel parameter when the kernels are 313 launched. 315 4. Optimizations 317 For this prototype Nvidia Tesla K10 GPU [5] is used which has 2 318 processors with 1536 cores each running at 745MHz. Each processor 319 has 4GB of memory attached to it. It is connected to CPU via PCI 3.0 320 x16 interface. 322 Server used is Dell R720 which has Intel Xeon 2665 with 2 processors 323 each having 16 cores. Only one CPU core is used for our experiment. 325 4.1 Variable size packet packing 327 Multiple copies from CPU to GPU is costly. Packets are batched for 328 processing in GPU. Packet sizes vary from 64 bytes to 1500 bytes. 329 Having a fixed size buffer for each packet, leads to copying a lot of 330 unwanted memory from CPU to GPU in case of smaller number of packets. 332 For variable size packing one single large buffer is allocated for 333 number of packets in the batch. Initial portion of the buffer has the 334 packet start offsets for all packets. At the packet offset, packet 335 size and packet contents are placed. Only buffer size filled with 336 packets is copied from CPU to GPU. 338 4.2 Pinned Memory 340 Memory allocated using malloc is paged memory. When coping from CPU 341 to GPU, memory is first copied from paged memory to non-paged memory, 342 then it is copied from non-paged memory to GPU global memory. 344 OpenCL provides commands and procedure to allocate and copy memory 345 from non-paged memory[3][4]. Using this pinned memory avoids one 346 internal copy and showed 3x improvements in memory copy time. In our 347 experiments pinned memory was used for CPU to GPU packet buffer copy 348 and GPU to CPU result buffer copy. 350 4.3 Pipelined Scheduler 352 OpenCL supports multiple command queues and Nvidia supports 32 353 command queues. Using non-blocking calls, commands can be placed on 354 each queue. When GPU kernel functions are being executed, memory copy 355 between CPU and GPU can happen in parallel. 357 In our experiment 6 command queues were created, 3 queues for each 358 GPU processor. Copy packet buffer to GPU, Launch GPU kernel functions 359 and read results from GPU are executed in parallel for 6 batches of 360 data. Scheduling is performed using round robin to maintain packet 361 order. Using pipe lining, allows hiding 99% of copy time and 362 utilizing the full processing power of GPU. 364 4.4 Reduce Global memory access 366 OpenCL architecture and NVidia GPU architecture has 3 levels of 367 memory, Global, Local and Private. Packets from CPU are copied to GPU 368 global memory. Global memory access is costly and Char by Char access 369 is not efficient. 371 Accessing private memory is faster but private memory is small, it 372 cannot hold complete packet. So packets are copied as 32 bytes at a 373 time using vload8 and float type. 375 4.5 Organizing GPU cores 377 Number of kernels functions(global-size) lunched should be more than 378 the number GPU cores to hide latency. GPU provides the sub-grouping 379 of cores to share memory. Optimal grouping size(local-size) is 380 calculated specific to GPU card. 382 5. Results 384 Using Aho-Corasick algorithm measured the performance of GPU system 385 with different parameters. Signature database is the top website 386 names. Ethernet and IP headers are skipped in the search, which is 34 387 bytes for each packet. Only protocol header analysis or application 388 header analysis can also be performed. 390 Aho-Corasick algorithm is modified to match any one string from the 391 signature database. After the first string is matched, result is 392 written in the result buffer and function exits. If none of the 393 string matched in the packet, whole packet is searched, this is the 394 worst-case performance. If any one of the string matched earlier then 395 remaining packet is not searched. 397 To understand the performance and to keep track of the timing of how 398 the commands execute, OpenCl supports a function 399 clGetEventProfilingInfo, which allows querying cl_event to get 400 counter values. The device time counter is returned in nanoseconds. 402 For these experiment results Nvidia Tesla K10 GPU and Dell R720 403 server is used. 405 Results were taken by executing on bare metal server on Linux. Same 406 code can be executed inside the Virtual Machine also. 408 5.1 Worst case 0 string match 410 Measured performance by varying signature database size to 1000, 2500 411 and 5000. Fixed size packets were generated with packet sizes 412 64/583/1450. Variable size packet generated with packet sizes from 64 413 to 1500 with an average packet size of 583 bytes and the results are 414 shown in Table 1 and Table 2. 416 +-----+--------+----------+----------+---------- +------------+ 417 |No of Strings | 64 Fixed | 583 Fixed| 1450 Fixed|583 Variable| 418 +-------------------------------------------------------------+ 419 | 1000 | 37.03 | 30.74 | 31.68 | 15.08 | 420 | 2500 | 37.03 | 30.17 | 31.15 | 14.94 | 421 | 5000 | 36.75 | 30.03 | 31.15 | 14.87 | 422 +--------------+----------+----------+----------=+------------+ 423 Table 1: Bandwidth in Gbps for different packet sizes of traffic 424 +-----+--------+----------+----------+-----------+------------+ 425 |No of Strings | 64 Fixed | 583 Fixed| 1450 Fixed|583 Variable| 426 +-------------------------------------------------------------+ 427 | 1000 | 77.67 | 7.07 | 2.93 | 3.47 | 428 | 2500 | 77.41 | 6.95 | 2.88 | 3.44 | 429 | 5000 | 77.07 | 6.91 | 2.88 | 3.42 | 430 +--------------+----------+----------+----------=+------------+ 431 Table 2: Number of packets in Million packets per second(mpps) 432 for different packet sizes of traffic 434 Varying signature database size with 1000, 2500 and 5000 do not have 435 any major impact. State machine size gets bigger based on signature 436 database size, but processing time for the packets remain the same. 438 For fixed size packets total bandwidth processed was always above 30 439 Gbps. For variable bandwidth packets total bandwidth processed is 440 14.9 Gbps. 442 Variable packet sizes vary from 64 to 1500 bytes, each packet being 443 assigned to one core. The core which finishes early is idle till 444 other cores complete their work. So full GPU power is not effectively 445 utilized when using variable length packets. 447 5.2 Packet match 449 Having match percentage as the key, different parameters are 450 measured. Table 3 shows the match percentage against the bandwidth in 451 Gbps. For this experiment variable size packets with an average of 452 583 bytes are used. 16384 packets are batched for processing in GPU 453 and 16384 threads are instantiated. Each packet is checked for 5000 454 strings. 456 +-----+--------+-----------+ 457 |% of packets | Bandwidth | 458 | matched | in Gbps | 459 +--------------------------+ 460 | 0 | 14.87 | 461 | 15 | 18.50 | 462 | 25 | 20.85 | 463 | 35 | 33.02 | 464 +--------------+-----------+ 465 Table 3: Bandwidth in Gbps for different packet match percentage 466 +-----+--------+--------------+ 467 |% of packets | No of Packets| 468 | matched | in mpps | 469 +-----------------------------+ 470 | 0 | 3.42 | 471 | 15 | 4.25 | 472 | 25 | 4.80 | 473 | 35 | 7.60 | 474 +--------------+------===-----+ 475 Table 4: Number of packets in Mpps for different packet 476 match percentage 478 The packet match percentage against number of packets processed in 479 mpps is shown in Table 4. Worst case experiment is 0 packets matched, 480 so whole packet need to be searched. Time for single buffer(16384 481 packets) copy from CPU to GPU is 0.903 milliseconds. Kernel execution 482 time for single buffer is 9.784 milliseconds. Result buffer copy from 483 GPU to CPU is 0.161 milliseconds. Total of 209 buffers processed in 484 one second, which is 3.42 million packets and 14.9 Gbps. 486 Best case experiment was executed with 35% of packets match. Time for 487 single buffer(16384 packets) copy from CPU to GPU is 0.923 488 milliseconds. Kernel execution time for single buffer is 4.38 489 milliseconds. Result buffer copy from GPU to CPU is 0.168 490 milliseconds. Total of 464 buffers processed in one second, which is 491 7.6 million packets and 33.02 Gbps. 493 6. Compute Acceleration API 495 Multiple compute accelerators like GPUs, Coprocessors, ASICs/FPGAs 496 and multi core CPUs can be used for NFV. Having a common API for NFV 497 Compute Acceleration (NCA) can abstract the hardware details and 498 enable NFV applications to use compute acceleration. This API will be 499 a C library, user can compile it along with their code. 501 The delivery of end-to-end services often requires various network 502 functions. Compute acceleration APIs should support the definition of 503 ordered set of network functions and subset of these network 504 functions which can be processed in parallel. 506 6.1 Add Network Function 508 Multiple network functions can be defined in the system. Network 509 functions are identified by network function id. Based on service 510 chain requirement network functions are dynamically loaded to the 511 cores and executed. The API function nca_add_ network-function adds a 512 new network function to the NCA. 514 In OpenCL terminology kernel is a function or set of function 515 executed in compute core. OpenCL code files are small files with 516 these functions called kernel functions. 518 int nca_add_network_function( 519 int network_func_id, 520 int (*network_func_init)(int network_func_id,void *init_params), 521 char *cl_file_name, 522 char *kernel_name, 523 int (*set_kernel_arg)(int network_func_id,void *sf_args, 524 char *pkt_buf), 525 int result_buffer_size 526 ) 528 network_func_id : Network function identifier unique 529 for every network function in the framework 530 network_func_init : Initializing the network function, with the 531 device memory allocations, service 532 specific data structures are created. 533 cl_file_name : File with network function kernel code 534 kernel_name : Network function kernel entry function name 535 set_kernel_arg : Function that will setup kernel arguments 536 before calling the kernel 537 result_buffer_size : result buffer size for this service function 539 6.2 Add Traffic Stream 541 Traffic streams are identified by stream id. Traffic streams are 542 initialized with number of buffers and size of each buffer allocated for 543 this stream. Each buffer is identified by a buffer id and it can hold N 544 number of packets. These buffers are treated like ring buffers. These 545 buffers are allocated as a contiguous memory by NCA and the pointer is 546 returned. 548 Any notification during buffer processing is given through the callback 549 function with stream_id, buffer_id and event. 551 Traffic stream is associated with a service function chain. 553 Service function chain is defined by three parameters. Number of network 554 functions is mentioned in num_network_funcs. Actual network function ids 555 are in service_function_chain array. Network functions are divided into 556 subsets. Each subset has a subset number. Network functions within the 557 subset can be executed in parallel. Subsets should be executed in 558 sequence. There is a special subset number 0, which can be executed 559 independent of any network functions in the chain 560 For example 6 service functions are represented below. 561 num_network_funcs = 6; 562 service_func_chain = {101, 105, 107, 108, 109, 110 } 563 network_func_ordered_set = {1, 1, 1, 2, 2, 0} 565 In the above example subset 1 which 101, 105, 107 should be executed 566 first. Within this subset all 3 can be executed in parallel. After 567 subset 1 subset 2 which is 108,109 will be executed. Subset 0 does not 568 have any dependencies; scheduler can execute it at any time. 570 typedef struct dev_params_s { 571 int dev_type, 572 int num_devices, 573 } nca_dev; 575 int nca_traffic_stream_init ( 576 int num_buffers, 577 int buffer_size, 578 int (*notify_callback)(int buffer_id,int event) 579 int num_network_funcs, 580 int service_func_chain[CAF_MAX_SF], 581 int network_func_parallel_sets[CAF_MAX_SF], 582 nca_dev dev_params, 583 ) 585 stream-id : Unique id to identify traffic stream 586 num_buffers : Number of buffers 587 buffer_size : Size of each buffer 588 notify_callback I : Event notification callback. 589 num_service_funcs : Number of service functions 590 in the service chain 591 service_func_chain : Service function ids in this service chain 592 network_func_parallel_set : subsets for sequential and parallel 593 ordering of service functions. 594 dev_params : For this traffic stream user can choose the 595 device for processing 596 Return Value : stream-id which is unique to identify 597 traffic stream 599 6.3 Add Packets to Buffer 601 Packets are added to the buffer directly by the client application or by 602 calling nca_add_packets. One or more packets can be added to the buffer. 604 int nca_add_packets( 605 int context_id, 606 int buffer_id, 607 char * packet, 608 int packet_len[] 609 int num_packets 610 ) 612 stream_id : Steam id of the traffic stream 613 buffer_id : Idenitfy the buffer to add the packet 614 packet : Packet contents 615 packet_len[] : Length of each packet 616 num_packets : Number of packets 618 6.4 Process Packets 620 Once packets are filled in the buffer, nca_buffer_ready is called to 621 process the buffer. This function can also be called without filling the 622 complete buffer. NCA scheduler marks this buffer for processing. 624 int nca_buffer_ready( 625 int context_id, 626 int buffer_id 627 ) 628 stream_id : Stream id identifies the traffic stream 629 buffer_id : Identify the buffer to add the packet 631 6.5 Event Notification 633 NCA will notify event about the buffer using the registered callback 634 function. After the buffer is processed for the registered services, 635 notify event callback is called. Client can read the result buffer. 637 int (*notify_callback) ( 638 int stream_id, 639 int buffer_id, 640 int event 641 ) 642 stream_id : Stream id identifies the traffic stream 643 buffer_id : Identify the buffer to add the packet 644 event : Event maps to one of the buffer events. If 645 the event is not specific to a buffer, buffer id is 0 647 6.6 Read Results 649 Client can read the results after service chain processing. Service 650 chain processing completion is notified by an event through call back 651 function. 653 int caf_read_results( 654 int context_id, 655 int buffer_id, 656 char *result_buffer 657 ) 658 stream_id : Stream id identifies the traffic stream 659 buffer_id : Identify the buffer to add the packet 660 result-buffer : Result buffer pointer to be copied. 662 7. Other Accelerators 664 The prototype multi string search written in OpenCL successfully 665 compiled and executed on both Intel Xeon Phi coprocessor and CPU only 666 system with minimal changes in make file. For CPU only systems memory 667 copies can be avoided. Since the optimizations for these platforms are 668 not carried out, the performance numbers are not published. 670 8. Conclusion 672 To get best performance out of GPUs with large number of cores, number 673 of threads executed in parallel should be large. For a single network 674 function the latency will be in milliseconds, so it will be suited for 675 network monitoring functions. If GPUs are tasked to do multiple network 676 functions in parallel it can be used for other NFV functions. 678 Assigning single core for each packet gives best performance when all 679 packet sizes are equal. For variable length packets performance goes 680 down because the core processing the smaller packet has to be idle till 681 the other cores complete processing the larger packets. 683 Code written in OpenCL is easily portable to other platforms like Intel 684 Xeon Phi, multicore CPU with just make file changes. Though the same 685 code execute correctly on all platforms, to achieve good performance, 686 platform specific optimizations need to be done. 688 Proposed a network compute acceleration framework which will have all 689 hardware specific optimizations and expose high level APIs to the 690 applications. A set of APIs for defining traffic streams, network 691 function addition and declaration of service chain with ordering method 692 which include sequential and parallel. 694 9. Future Work 696 Dynamic device discovery and optimized code for different algorithms and 697 devices will make NCA as a common platform to develop applications on 698 top of this. 700 Integration of compute acceleration with IO acceleration technologies 701 like Intel DPDK[9] can provide a complete networking platform for the 702 applications. 704 Verification and performance measurement of compute acceleration 705 platform running inside a Virtual Machines. Compute acceleration 706 platform running inside Linux containers or Docker. 708 10 Security Considerations 710 Not Applicable 712 11 IANA Considerations 714 Not Applicable 716 12 References 718 12.1 Normative References 720 12.2 Informative References 722 [1] ETSI NFV White Paper: "Network Functions Virtualisation, An 723 Introduction, Benefits, Enablers, Challenges, & Call for 724 Action,"http://portal.etsi.org/NFV/NFV_White_Paper.pdf" 725 [2] A.V.Aho and M.J.Corasick, "Efficient string matching:An aid to 726 A.v. Aho and M.j.Corasick, "Efficient string matching:An 727 aid to bibliographic search",Communications of the ACM, 728 vol. 20, Session 10. 729 [3] OpenCl Specification, 730 "https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf" 731 [4] OpenCl Best Practices Guide, 732 "http://www.nvidia.com/content/cudazone/CUDABrowser/ 733 downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf" 734 [5] Nvidia Tesla K10, http://www.nvidia.in/content/PDF/kepler/Tesla- 735 K10-Board-Spec-BD-06280-001-v07.pdf 736 [6] Intel Xeon Phi 737 "http://www.intel.in/content/www/in/en/processors/xeon/ 738 xeon-phi-detail.html" 739 [7] Intel OpenCl "https://software.intel.com/en-us/intel-opencl" 741 [8] Implementing Regular Expressions "https://swtch.com/~rsc/regexp/" 742 [9] Intel DPDK "http://dpdk.org/" 743 [10] AMD OpenCl Zone, "http://developer.amd.com/tools-and- 744 sdks/opencl-zone/" 745 [11] Nvida OpenCl "https://developer.nvidia.com/opencl 747 Acknowledgements 749 The authors would like to thank the following individuals for their 750 support in verifying the prototype in different platforms : Shiva 751 Katta and K. Narendra. 753 Authors' Addresses 755 Bose Perumal 756 Dell 757 Bose_Perumal@Dell.com 759 Wenjing Chu 760 Dell 761 Wenjing_Chu@Dell.com 763 Ram (Ramki) Krishnan 764 Dell 765 Ramki_Krishnan@Dell.com 767 Hemalathaa S 768 Dell 769 Hemalathaa_S@Dell.com