idnits 2.17.1 

draft-perumal-nfvrg-nfv-compute-acceleration-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 50 instances of lines with control characters in the document.

  ** The abstract seems to contain references ([3], [6], [7], [10], [11]),
     which it shouldn't.  Please replace those with straight textual mentions
     of the documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 265 has weird spacing: '...mber of  netwo...'

  == Line 274 has weird spacing: '... pinned  memor...'

  == Line 593 has weird spacing: '...ring of  servi...'

  == Line 595 has weird spacing: '...ice for  proce...'

  == Line 691 has weird spacing: '...ion and  decla...'

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: 'RFC2119' on line 159


     Summary: 2 errors (**), 0 flaws (~~), 7 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFVRG
3	Internet-Draft                                              Bose Perumal
4	Intended Status: Informational                               Wenjing Chu
5	                                                             R. Krishnan
6	                                                           Hemalathaa. S
7	                                                                    Dell
8	Expires: December 3 2015                                    June 29 2015

10	              NFV Compute Acceleration Evaluation and APIs
11	            draft-perumal-nfvrg-nfv-compute-acceleration-00

13	Abstract

15	   Network functions are being virtualized and moved to industry
16	   standard servers. Steady growth of traffic volume requires more
17	   compute power to process the network functions. Network packet based
18	   architecture provides a lot of scope for parallel processing. Generic
19	   parallel processing can be done in common multicore platforms like
20	   GPUs, coprocessors like Intel Xeon Phi[6][7] and Intel[7]/AMD[10]
21	   multicore CPUs. In this draft to check the feasibility and to exploit
22	   this parallel processing capability, multi string matching is taken
23	   as the sample network function for URL filtering. Aho-Corasick
24	   algorithm has been made use for multi pattern matching.
25	   Implementation utilizes OpenCL [3] to support many common
26	   platforms[7][10][11]. A list of optimizations is done, the
27	   application is tested on Nvidia Tesla K10 GPUs. A common API for NFV
28	   Compute Acceleration has been proposed.

30	Status of this Memo

32	   This Internet-Draft is submitted to IETF in full conformance with the
33	   provisions of BCP 78 and BCP 79.

35	   Internet-Drafts are working documents of the Internet Engineering
36	   Task Force (IETF), its areas, and its working groups.  Note that
37	   other groups may also distribute working documents as
38	   Internet-Drafts.

40	   Internet-Drafts are draft documents valid for a maximum of six months
41	   and may be updated, replaced, or obsoleted by other documents at any
42	   time.  It is inappropriate to use Internet-Drafts as reference
43	   material or to cite them other than as "work in progress."

45	   The list of current Internet-Drafts can be accessed at
46	   http://www.ietf.org/1id-abstracts.html
47	   The list of Internet-Draft Shadow Directories can be accessed at
48	   http://www.ietf.org/shadow.html

50	Copyright and License Notice

52	   Copyright (c) 2015 IETF Trust and the persons identified as the
53	   document authors. All rights reserved.

55	   This document is subject to BCP 78 and the IETF Trust's Legal
56	   Provisions Relating to IETF Documents
57	   (http://trustee.ietf.org/license-info) in effect on the date of
58	   publication of this document. Please review these documents
59	   carefully, as they describe your rights and restrictions with respect
60	   to this document. Code Components extracted from this document must
61	   include Simplified BSD License text as described in Section 4.e of
62	   the Trust Legal Provisions and are provided without warranty as
63	   described in the Simplified BSD License.

65	Table of Contents

67	   1  Introduction  . . . . . . . . . . . . . . . . . . . . . . . . .  4
68	     1.1  Terminology . . . . . . . . . . . . . . . . . . . . . . . .  5
69	   2. OpenCL based Virtual Network Function Architecture  . . . . . .  6
70	     2.1 CPU Process  . . . . . . . . . . . . . . . . . . . . . . . .  7
71	     2.2 Device Discovery . . . . . . . . . . . . . . . . . . . . . .  7
72	     2.3 Mixed Version Support  . . . . . . . . . . . . . . . . . . .  7
73	     2.4 Scheduler  . . . . . . . . . . . . . . . . . . . . . . . . .  8
74	   3. Aho-Corasick Algorithm  . . . . . . . . . . . . . . . . . . . .  9
75	   4. Optimizations . . . . . . . . . . . . . . . . . . . . . . . . .  9
76	     4.1 Variable size packet packing . . . . . . . . . . . . . . . .  9
77	     4.2 Pinned Memory  . . . . . . . . . . . . . . . . . . . . . . . 10
78	     4.3 Pipelined Scheduler  . . . . . . . . . . . . . . . . . . . . 10
79	     4.4 Reduce Global memory access  . . . . . . . . . . . . . . . . 10
80	     4.5 Organizing GPU cores . . . . . . . . . . . . . . . . . . . . 10
81	   5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
82	     5.1 Worst case 0 string match  . . . . . . . . . . . . . . . . . 11
83	     5.2 Packet match . . . . . . . . . . . . . . . . . . . . . . . . 12
84	   6. Compute Acceleration API  . . . . . . . . . . . . . . . . . . . 13
85	     6.1 Add Network Function . . . . . . . . . . . . . . . . . . . . 13
86	     6.2 Add Traffic Stream . . . . . . . . . . . . . . . . . . . . . 14
87	     6.3 Add Packets to Buffer  . . . . . . . . . . . . . . . . . . . 16
88	     6.4 Process Packets  . . . . . . . . . . . . . . . . . . . . . . 16
89	     6.5 Event Notification . . . . . . . . . . . . . . . . . . . . . 16
90	     6.6 Read Results . . . . . . . . . . . . . . . . . . . . . . . . 17
91	   7. Other Accelerators  . . . . . . . . . . . . . . . . . . . . . . 17
92	   8. Conclusion  . . . . . . . . . . . . . . . . . . . . . . . . . . 17
93	   9. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 18
94	   10  Security Considerations  . . . . . . . . . . . . . . . . . . . 18
95	   11  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 18
96	   12  References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
97	     12.1  Normative References . . . . . . . . . . . . . . . . . . . 18
98	     12.2  Informative References . . . . . . . . . . . . . . . . . . 18
99	   Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . 19
100	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19

102	1  Introduction

104	   Network equipment vendors use specialized hardware to process data at
105	   a low latency and high throughput. Packet processing above 4 Gb/s is
106	   done using expensive, purpose-built application-specific integrated
107	   circuits. However, the low unit volumes force manufacturers to price
108	   these devices at many times the cost of producing them, to recover
109	   the R&D cost.

111	   Network Function Virtualization (NFV)[1] is a key emerging area for
112	   network operators, hardware and software vendors, cloud service
113	   providers, and in general network practitioners and researchers. NFV
114	   introduces virtualization technologies into the core network to
115	   create a more intelligent, more agile service infrastructure. Network
116	   functions that are traditionally implemented in dedicated hardware
117	   appliances will need to be decomposed and executed in virtual
118	   machines running in data centers. The parallelism of graphics
119	   processor provides it the potential to function as network
120	   coprocessor.

122	   Network virtual function is responsible for specific treatment of
123	   received packets.  A network virtual function can act at various
124	   layers of a protocol stack. When there is more compute power,
125	   multiple virtual network functions can be executed in a single system
126	   or VM. When multiple virtual network functions are processed in a
127	   system, some of them could be processed in parallel with other
128	   network functions. This paper proposes a method to represent ordered
129	   set of virtual network functions in a combination of a sequential and
130	   parallel order. This draft is for software based network functions,
131	   so any further reference to network function means virtual network
132	   function.

134	   Software written for specialized hardware like network processors,
135	   ASIC, FPGA, is closely tied to the hardware and specific vendor
136	   products. It cannot be reused in other hardware platforms.  For
137	   generic compute acceleration different hardware platforms can be
138	   used, like GPUs from different vendors, Intel Xeon Phi coprocessors
139	   and multi core CPUs from different vendors. All these compute
140	   acceleration platforms support OpenCL as parallel programming
141	   language. Instead of every vendor writing OpenCL code, NFV Compute
142	   Acceleration (NCA) API has been proposed for a common compute
143	   accelerator in this paper. This API will be a library with C API
144	   functions for declaring the network functions as an ordered set and
145	   moving packets around.

147	   Multi-pattern string matching is used in a number of applications
148	   including network intrusion detection and digital forensics. Hence
149	   multi pattern matching is chosen as a sample network function. Aho-
150	   Corasick[2] algorithm with few modifications has been used to find
151	   the first occurrence of any pattern from the signature database.
152	   Based on this network function the throughput numbers are measured.

154	1.1  Terminology

156	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
157	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
158	   document are to be interpreted as described in RFC 2119 [RFC2119].

160	2. OpenCL based Virtual Network Function Architecture

162	   Network functions like multi pattern matching is process intensive
163	   and common for multiple NFV applications. Generic compute
164	   acceleration framework functions and service specific functions are
165	   clearly separated in this prototype. The architecture diagram with
166	   one network function is shown in Figure 1. Multiple network functions
167	   can also be loaded. Most of the signature based algorithms like Aho-
168	   Corasick[2], Regex[8],etc. generate a Deterministic Finite
169	   Automaton(DFA)[2][8]. DFA database is generated in CPU and loaded to
170	   the accelerator. Kernels executed in the accelerator will use the
171	   DFA.

173	   +----------------------------------------+      +-------------------+
174	   | CPU Process                            |      | GPU/Xeon Phi,etc. |
175	   |                                        |      |                   |
176	   |                                        |      |                   |
177	   |                                        |      |                   |
178	   |                         Scheduler      |      |                   |
179	   |  +------------+      +------------+    |      |  +------------+   |
180	   |  |  Packet    |      |Copy Packet |    |      |  |Input Buffer|   |
181	   |  |Generator   +----->|CPU to GPU  +------------->|P1,P2,...,Pn|   |
182	   |  +------------+      +------------+    |      |  +-----+------+   |
183	   |                                        |      |        |          |
184	   |                      +------------+    |      |  +-----v------+   |
185	   |                      |Launch GPU  |    |      |  |GPU Kernels |   |
186	   |                      |Kernels     +------------->|K1,K2,...,Kn|<+ |
187	   |                      +------------+    |      |  +-----+------+ | |
188	   |                                        |      |        |        | |
189	   | +-------------+      +------------+    |      |  +-----v------+ | |
190	   | |Results for  |      |Copy Results|    |      |  |Result Buf  | | |
191	   | |each packet  +<-----+GPU to CPU  |<-------------+R1,R2,...,Rn| | |
192	   | +-------------+      +------------+    |      |  +------------+ | |
193	   |                                        |      |                 | |
194	   |                                        |      |                 | |
195	   |  +----------------+   +-----------+    |      |  +------------+ | |
196	   |  |Network Function|   |    NF     |    |      |  |    NF      | | |
197	   |  |(AC,Regex,etc)  +-->| Database  +------------->| Database   +-+ |
198	   |  +-------+--------+   +-----------+    |      |  +------------+   |
199	   |           ^                            |      |                   |
200	   +-----------|----------------------------+      +-------------------+
201	               |
202	          +----+------+
203	          | Signature |
204	          |  Database |
205	          +-----------+
206	   Figure 1. OpenCl based Virtual Network Function
207	             Software Architecture Diagram

209	2.1 CPU Process

211	   Accelerators like GPUs or coprocessors will augment CPU and currently
212	   they cannot function alone. Network virtual function is split between
213	   CPU and GPU. CPU process owns the packet preprocessing, packet
214	   movement and scheduling. GPUs will do the core functionality of the
215	   network functions. CPU process interfaces between the packet I/O and
216	   GPU. During initialization it does set of following functions.

218	   	1. Device Discovery
219	   	2. Initialize OpenCL object model
220	   	3. Initialize memory module
221	   	4. Initialize network functions
222	   	5. Trigger scheduler

224	2.2 Device Discovery

226	   Using OpenCL functions device discovery module discovers the
227	   platforms and devices. Based on number of devices discovered, device
228	   context and command queues are created.

230	2.3 Mixed Version Support

232	   OpenCl is designed to support devices with different capabilities
233	   under a single platform[3].  There are three version identifiers in
234	   OpenCl, the platform version, the version of a device, and the
235	   version(s) of the OpenCl C language supported on a device.

237	   The platform version indicates the version of the OpenCL runtime
238	   supported. The device version is an indication of the devices
239	   capabilities. The language version for a device represents the OpenCL
240	   programming language features a developer can assume are supported on
241	   a given device.

243	   OpenCl C is designed to be backwards compatible, so a device is not
244	   required to support more than a single language version to be
245	   considered conformant. If multiple language versions are supported,
246	   the compiler defaults to using the highest language version supported
247	   for the device.

249	   Code written for old device version may not utilize the full
250	   capabilities of new device if there are hardware architectural
251	   changes.

253	2.4 Scheduler

255	   Scheduling between the packet buffers coming from the network I/O to
256	   the device command queues is carried out by the scheduler. Scheduler
257	   operates on following parameters.

259	   	N  - Number of Packet buffers (Default 6)
260	   	M  - Number of Packets in each buffer (Default 16384)
261	   	K  - Number of Devices (Discovered 2)
262	   	J  - Number of Command Queues for each device (Default 3)
263	   	I  - Number of  Commands to the device to complete
264	   	            single network function (Default 3)
265	   	S  - Number of  network functions executed in parallel. (Default 1)

267	   Default values mentioned above are for the best results in our
268	   current hardware environment and multi string match function.

270	   Operations for completing network function for one packet buffer
271	   	1. Identify a free command queue
272	   	2. Copy packets from IO memory to pinned memory for GPU
273	   	3. Fill Kernel function parameters
274	   	4. Copy pinned  memory to GPU  global memory
275	   	5. Launch kernels for number of packets in the packet buffer
276	   	6. Check kernel execution completion and collect results
277	   	7. Report results to application

279	   Scheduler calls OpenCl API with number of kernels to be executed in
280	   parallel. Distributing the kernels to cores is taken care by OpenCl
281	   library. If there are any error during launching the kernels, OpenCl
282	   API returns error and appropriate error handling can he done.

284	3. Aho-Corasick Algorithm

286	   The Aho-Corasick algorithm [2] is the most effective multi pattern
287	   matching algorithm. Aho-Corasick algorithm is a kind of dictionary-
288	   matching algorithm that locates elements of a finite set of strings
289	   within an input text. The complexity of the algorithm is linear in
290	   the length of the patterns plus the length of the searched text plus
291	   the number of output matches.

293	   The algorithm works in two parts. The first part is the building of
294	   the tree from keywords that needs to be searched for, and the second
295	   part is searching the text for the keywords using the previously
296	   built tree (state machine). Searching for a keyword is efficient,
297	   because it only moves through the states in the state machine. If a
298	   character is match, goto () function is executed otherwise it follows
299	   fail () function. Match found is returned by the out () function.

301	   All the three functions just access the indexed data structures and
302	   return the value. goto () data structure is a two dimension matrix
303	   accessed based on current state and currently compared character.
304	   fail () function is an array, which has the link to alternate path
305	   for each state. Out function is an array of states and it has the
306	   records on whether the string search has completed on a particular
307	   state.

309	   Based on the signature database, all three data structures are
310	   constructed in CPU. These data structures are copied to GPU global
311	   memory during the initialization stage. Pointers to these data
312	   structures are passed as the kernel parameter when the kernels are
313	   launched.

315	4. Optimizations

317	   For this prototype Nvidia Tesla K10 GPU [5] is used which has 2
318	   processors with 1536 cores each running at 745MHz.  Each processor
319	   has 4GB of memory attached to it. It is connected to CPU via PCI 3.0
320	   x16 interface.

322	   Server used is Dell R720 which has Intel Xeon 2665 with 2 processors
323	   each having 16 cores. Only one CPU core is used for our experiment.

325	4.1 Variable size packet packing

327	   Multiple copies from CPU to GPU is costly. Packets are batched for
328	   processing in GPU. Packet sizes vary from 64 bytes to 1500 bytes.
329	   Having a fixed size buffer for each packet, leads to copying a lot of
330	   unwanted memory from CPU to GPU in case of smaller number of packets.

332	   For variable size packing one single large buffer is allocated for
333	   number of packets in the batch. Initial portion of the buffer has the
334	   packet start offsets for all packets. At the packet offset, packet
335	   size and packet contents are placed. Only buffer size filled with
336	   packets is copied from CPU to GPU.

338	4.2 Pinned Memory

340	   Memory allocated using malloc is paged memory. When coping from CPU
341	   to GPU, memory is first copied from paged memory to non-paged memory,
342	   then it is copied from non-paged memory to GPU global memory.

344	   OpenCL provides commands and procedure to allocate and copy memory
345	   from non-paged memory[3][4]. Using this pinned memory avoids one
346	   internal copy and showed 3x improvements in memory copy time. In our
347	   experiments pinned memory was used for CPU to GPU packet buffer copy
348	   and GPU to CPU result buffer copy.

350	4.3 Pipelined Scheduler

352	   OpenCL supports multiple command queues and Nvidia supports 32
353	   command queues. Using non-blocking calls, commands can be placed on
354	   each queue. When GPU kernel functions are being executed, memory copy
355	   between CPU and GPU can happen in parallel.

357	   In our experiment 6 command queues were created, 3 queues for each
358	   GPU processor. Copy packet buffer to GPU, Launch GPU kernel functions
359	   and read results from GPU are executed in parallel for 6 batches of
360	   data. Scheduling is performed using round robin to maintain packet
361	   order. Using pipe lining, allows hiding 99% of copy time and
362	   utilizing the full processing power of GPU.

364	4.4 Reduce Global memory access

366	   OpenCL architecture and NVidia GPU architecture has 3 levels of
367	   memory, Global, Local and Private. Packets from CPU are copied to GPU
368	   global memory. Global memory access is costly and Char by Char access
369	   is not efficient.

371	   Accessing private memory is faster but private memory is small, it
372	   cannot hold complete packet. So packets are copied as 32 bytes at a
373	   time using vload8 and float type.

375	4.5 Organizing GPU cores

377	   Number of kernels functions(global-size) lunched should be more than
378	   the number GPU cores to hide latency. GPU provides the sub-grouping
379	   of cores to share memory. Optimal grouping size(local-size) is
380	   calculated specific to GPU card.

382	5. Results

384	   Using Aho-Corasick algorithm measured the performance of GPU system
385	   with different parameters. Signature database is the top website
386	   names. Ethernet and IP headers are skipped in the search, which is 34
387	   bytes for each packet. Only protocol header analysis or application
388	   header analysis can also be performed.

390	   Aho-Corasick algorithm is modified to match any one string from the
391	   signature database. After the first string is matched, result is
392	   written in the result buffer and function exits. If none of the
393	   string matched in the packet, whole packet is searched, this is the
394	   worst-case performance. If any one of the string matched earlier then
395	   remaining packet is not searched.

397	   To understand the performance and to keep track of the timing of how
398	   the commands execute, OpenCl supports a function
399	   clGetEventProfilingInfo, which allows querying cl_event to get
400	   counter values. The device time counter is returned in nanoseconds.

402	   For these experiment results Nvidia Tesla K10 GPU and Dell R720
403	   server is used.

405	   Results were taken by executing on bare metal server on Linux. Same
406	   code can be executed inside the Virtual Machine also.

408	5.1 Worst case 0 string match

410	   Measured performance by varying signature database size to 1000, 2500
411	   and 5000. Fixed size packets were generated with packet sizes
412	   64/583/1450. Variable size packet generated with packet sizes from 64
413	   to 1500 with an average packet size of 583 bytes and the results are
414	   shown in Table 1 and Table 2.

416	   +-----+--------+----------+----------+---------- +------------+
417	   |No of Strings | 64 Fixed | 583 Fixed| 1450 Fixed|583 Variable|
418	   +-------------------------------------------------------------+
419	   |   1000       |   37.03  |  30.74   |   31.68   |  15.08     |
420	   |   2500       |   37.03  |  30.17   |   31.15   |  14.94     |
421	   |   5000       |   36.75  |  30.03   |   31.15   |  14.87     |
422	   +--------------+----------+----------+----------=+------------+
423	   Table 1: Bandwidth in Gbps for different packet sizes of traffic
424	   +-----+--------+----------+----------+-----------+------------+
425	   |No of Strings | 64 Fixed | 583 Fixed| 1450 Fixed|583 Variable|
426	   +-------------------------------------------------------------+
427	   |   1000       |   77.67  |   7.07   |    2.93   |   3.47     |
428	   |   2500       |   77.41  |   6.95   |    2.88   |   3.44     |
429	   |   5000       |   77.07  |   6.91   |    2.88   |   3.42     |
430	   +--------------+----------+----------+----------=+------------+
431	   Table 2: Number of packets in Million packets per second(mpps)
432	            for different packet sizes of traffic

434	   Varying signature database size with 1000, 2500 and 5000 do not have
435	   any major impact. State machine size gets bigger based on signature
436	   database size, but processing time for the packets remain the same.

438	   For fixed size packets total bandwidth processed was always above 30
439	   Gbps. For variable bandwidth packets total bandwidth processed is
440	   14.9 Gbps.

442	   Variable packet sizes vary from 64 to 1500 bytes, each packet being
443	   assigned to one core. The core which finishes early is idle till
444	   other cores complete their work. So full GPU power is not effectively
445	   utilized when using variable length packets.

447	5.2 Packet match

449	   Having match percentage as the key, different parameters are
450	   measured. Table 3 shows the match percentage against the bandwidth in
451	   Gbps. For this experiment variable size packets with an average of
452	   583 bytes are used. 16384 packets are batched for processing in GPU
453	   and 16384 threads are instantiated. Each packet is checked for 5000
454	   strings.

456	   +-----+--------+-----------+
457	   |% of packets  | Bandwidth |
458	   |  matched     |  in Gbps  |
459	   +--------------------------+
460	   |   0          |   14.87   |
461	   |   15         |   18.50   |
462	   |   25         |   20.85   |
463	   |   35         |   33.02   |
464	   +--------------+-----------+
465	   Table 3: Bandwidth in Gbps for different packet match percentage
466	   +-----+--------+--------------+
467	   |% of packets  | No of Packets|
468	   |  matched     |  in mpps     |
469	   +-----------------------------+
470	   |   0          |    3.42      |
471	   |   15         |    4.25      |
472	   |   25         |    4.80      |
473	   |   35         |    7.60      |
474	   +--------------+------===-----+
475	   Table 4: Number of packets in Mpps for different packet
476	            match percentage

478	   The packet match percentage against number of packets processed in
479	   mpps is shown in Table 4. Worst case experiment is 0 packets matched,
480	   so whole packet need to be searched. Time for single buffer(16384
481	   packets) copy from CPU to GPU is 0.903 milliseconds. Kernel execution
482	   time for single buffer is 9.784 milliseconds. Result buffer copy from
483	   GPU to CPU is 0.161 milliseconds. Total of 209 buffers processed in
484	   one second, which is 3.42 million packets and 14.9 Gbps.

486	   Best case experiment was executed with 35% of packets match. Time for
487	   single buffer(16384 packets) copy from CPU to GPU is 0.923
488	   milliseconds. Kernel execution time for single buffer is 4.38
489	   milliseconds. Result buffer copy from GPU to CPU is 0.168
490	   milliseconds. Total of 464 buffers processed in one second, which is
491	   7.6 million packets and 33.02 Gbps.

493	6. Compute Acceleration API

495	   Multiple compute accelerators like GPUs, Coprocessors, ASICs/FPGAs
496	   and multi core CPUs can be used for NFV. Having a common API for NFV
497	   Compute Acceleration (NCA) can abstract the hardware details and
498	   enable NFV applications to use compute acceleration. This API will be
499	   a C library, user can compile it along with their code.

501	   The delivery of end-to-end services often requires various network
502	   functions. Compute acceleration APIs should support the definition of
503	   ordered set of network functions and subset of these network
504	   functions which can be processed in parallel.

506	6.1 Add Network Function

508	   Multiple network functions can be defined in the system. Network
509	   functions are identified by network function id. Based on service
510	   chain requirement network functions are dynamically loaded to the
511	   cores and executed. The API function nca_add_ network-function adds a
512	   new network function to the NCA.

514	   In OpenCL terminology kernel is a function or set of function
515	   executed in compute core. OpenCL code files are small files with
516	   these functions called kernel functions.

518	      int nca_add_network_function(
519	      	int network_func_id,
520	      	int (*network_func_init)(int network_func_id,void *init_params),
521	      	char *cl_file_name,
522	      	char *kernel_name,
523	      	int (*set_kernel_arg)(int network_func_id,void *sf_args,
524	      			char *pkt_buf),
525	      	int result_buffer_size
526	      )

528	      network_func_id    : Network function identifier unique
529	                           for every network function in the framework
530	      network_func_init  : Initializing the network function, with the
531	                           device memory allocations, service
532	                           specific data structures are created.
533	      cl_file_name       : File with network function kernel code
534	      kernel_name        : Network function kernel entry function name
535	      set_kernel_arg     : Function that will setup kernel arguments
536	                           before calling the kernel
537	      result_buffer_size : result buffer size for this service function

539	6.2 Add Traffic Stream

541	Traffic streams are identified by stream id. Traffic streams are
542	initialized with number of buffers and size of each buffer allocated for
543	this stream. Each buffer is identified by a buffer id and it can hold N
544	number of packets. These buffers are treated like ring buffers. These
545	buffers are allocated as a contiguous memory by NCA and the pointer is
546	returned.

548	Any notification during buffer processing is given through the callback
549	function with stream_id, buffer_id and event.

551	Traffic stream is associated with a service function chain.

553	Service function chain is defined by three parameters. Number of network
554	functions is mentioned in num_network_funcs. Actual network function ids
555	are in service_function_chain array. Network functions are divided into
556	subsets. Each subset has a subset number. Network functions within the
557	subset can be executed in parallel. Subsets should be executed in
558	sequence. There is a special subset number 0, which can be executed
559	independent of any network functions in the chain
560	For example 6 service functions are represented below.
561	     num_network_funcs        = 6;
562	     service_func_chain       = {101, 105, 107, 108, 109, 110 }
563	     network_func_ordered_set = {1, 1, 1, 2, 2, 0}

565	In the above example subset 1 which 101, 105, 107 should be executed
566	first. Within this subset all 3 can be executed in parallel.   After
567	subset 1 subset 2 which is 108,109 will be executed. Subset 0 does not
568	have any dependencies; scheduler can execute it at any time.

570	      typedef struct dev_params_s {
571	               int dev_type,
572	               int num_devices,
573	      } nca_dev;

575	      int nca_traffic_stream_init (
576	      	int num_buffers,
577	      	int buffer_size,
578	      	int (*notify_callback)(int buffer_id,int event)
579	      	int num_network_funcs,
580	      	int service_func_chain[CAF_MAX_SF],
581	      	int network_func_parallel_sets[CAF_MAX_SF],
582	      	nca_dev dev_params,
583	       )

585	      stream-id          : Unique id to identify traffic stream
586	      num_buffers        : Number of buffers
587	      buffer_size        : Size of each buffer
588	      notify_callback I  : Event notification callback.
589	      num_service_funcs  : Number of service functions
590	      	            in the service chain
591	      service_func_chain : Service function ids in this service chain
592	      network_func_parallel_set : subsets for sequential and parallel
593	      			ordering of  service functions.
594	      dev_params         : For this traffic stream user can choose the
595	                           device for  processing
596	      Return Value       : stream-id which is unique to identify
597	                           traffic stream

599	6.3 Add Packets to Buffer

601	Packets are added to the buffer directly by the client application or by
602	calling nca_add_packets. One or more packets can be added to the buffer.

604	      int nca_add_packets(
605	      	int context_id,
606	      	int buffer_id,
607	      	char * packet,
608	      	int packet_len[]
609	      	int num_packets
610	      )

612	      stream_id       : Steam id of the traffic stream
613	      buffer_id       : Idenitfy the buffer to add the packet
614	      packet          : Packet contents
615	      packet_len[]    : Length of each packet
616	      num_packets     : Number of packets

618	6.4 Process Packets

620	Once packets are filled in the buffer, nca_buffer_ready is called to
621	process the buffer. This function can also be called without filling the
622	complete buffer. NCA scheduler marks this buffer for processing.

624	      int nca_buffer_ready(
625	      	int context_id,
626	      	int buffer_id
627	      )
628	      stream_id      :  Stream id identifies the traffic stream
629	      buffer_id      :  Identify the buffer to add the packet

631	6.5 Event Notification

633	NCA will notify event about the buffer using the registered callback
634	function. After the buffer is processed for the registered services,
635	notify event callback is called. Client can read the result buffer.

637	      int (*notify_callback) (
638	      	int stream_id,
639	      	int buffer_id,
640	      	int event
641	      )
642	      stream_id      :  Stream id identifies the traffic stream
643	      buffer_id      :  Identify the buffer to add the packet
644	      event          :  Event maps to one of the buffer events. If
645	      		the event is not specific to a buffer, buffer id is 0

647	6.6 Read Results

649	Client can read the results after service chain processing. Service
650	chain processing completion is notified by an event through call back
651	function.

653	      int caf_read_results(
654	      	int context_id,
655	      	int buffer_id,
656	      	char *result_buffer
657	      )
658	      stream_id      : Stream id identifies the traffic stream
659	      buffer_id      : Identify the buffer to add the packet
660	      result-buffer  : Result buffer pointer to be copied.

662	7. Other Accelerators

664	The prototype multi string search written in OpenCL successfully
665	compiled and executed on both Intel Xeon Phi coprocessor and CPU only
666	system with minimal changes in make file. For CPU only systems memory
667	copies can be avoided.  Since the optimizations for these platforms are
668	not carried out, the performance numbers are not published.

670	8. Conclusion

672	To get best performance out of GPUs with large number of cores, number
673	of threads executed in parallel should be large. For a single network
674	function the latency will be in milliseconds, so it will be suited for
675	network monitoring functions. If GPUs are tasked to do multiple network
676	functions in parallel it can be used for other NFV functions.

678	Assigning single core for each packet gives best performance when all
679	packet sizes are equal. For variable length packets performance goes
680	down because the core processing the smaller packet has to be idle till
681	the other cores complete processing the larger packets.

683	Code written in OpenCL is easily portable to other platforms like Intel
684	Xeon Phi, multicore CPU with just make file changes. Though the same
685	code execute correctly on all platforms, to achieve good performance,
686	platform specific optimizations need to be done.

688	Proposed a network compute acceleration framework which will have all
689	hardware specific optimizations and expose high level APIs to the
690	applications. A set of APIs for defining traffic streams, network
691	function addition and  declaration of service chain with ordering method
692	which include sequential and parallel.

694	9. Future Work

696	Dynamic device discovery and optimized code for different algorithms and
697	devices will make NCA as a common platform to develop applications on
698	top of this.

700	Integration of compute acceleration with IO acceleration technologies
701	like Intel DPDK[9] can provide a complete networking platform for the
702	applications.

704	Verification and performance measurement of compute acceleration
705	platform running inside a Virtual Machines. Compute acceleration
706	platform running inside Linux containers or Docker.

708	10  Security Considerations

710	Not Applicable

712	11  IANA Considerations

714	Not Applicable

716	12  References

718	12.1  Normative References

720	12.2  Informative References

722	   [1] ETSI NFV White Paper: "Network Functions Virtualisation, An
723	              Introduction, Benefits, Enablers, Challenges, & Call for
724	              Action,"http://portal.etsi.org/NFV/NFV_White_Paper.pdf"
725	   [2]	A.V.Aho and M.J.Corasick, "Efficient string matching:An aid to
726	              A.v. Aho and M.j.Corasick, "Efficient string matching:An
727	              aid to bibliographic search",Communications of the ACM,
728	              vol. 20, Session 10.
729	   [3] OpenCl Specification,
730	              "https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf"
731	   [4] OpenCl Best Practices Guide,
732	              "http://www.nvidia.com/content/cudazone/CUDABrowser/
733	              downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf"
734	   [5] Nvidia Tesla K10, http://www.nvidia.in/content/PDF/kepler/Tesla-
735	              K10-Board-Spec-BD-06280-001-v07.pdf
736	   [6] Intel Xeon Phi
737	              "http://www.intel.in/content/www/in/en/processors/xeon/
738	              xeon-phi-detail.html"
739	   [7] Intel OpenCl "https://software.intel.com/en-us/intel-opencl"

741	   [8] Implementing Regular Expressions "https://swtch.com/~rsc/regexp/"
742	   [9] Intel DPDK "http://dpdk.org/"
743	   [10] AMD OpenCl Zone, "http://developer.amd.com/tools-and-
744	              sdks/opencl-zone/"
745	   [11] Nvida OpenCl "https://developer.nvidia.com/opencl

747	Acknowledgements

749	   The authors would like to thank the following individuals for their
750	   support in verifying the prototype in different platforms : Shiva
751	   Katta and K. Narendra.

753	Authors' Addresses

755	   Bose Perumal
756	   Dell
757	   Bose_Perumal@Dell.com

759	   Wenjing Chu
760	   Dell
761	   Wenjing_Chu@Dell.com

763	   Ram (Ramki) Krishnan
764	   Dell
765	   Ramki_Krishnan@Dell.com

767	   Hemalathaa S
768	   Dell
769	   Hemalathaa_S@Dell.com