[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[dccp] shared buffer DCCP API



Hi, All,

   As Eddie mentioned, we have been designing a high performance DCCP API 
and here it is :) The API is still tentative, any of your  suggestions 
or comments are very welcome.

------------------------------------------------------------------------------

  The core of the API is a shared cyclic buffer between the kernel(as consumer)
and any user application(as producer). There are two pointers to this buffer, 
u_p and k_p, u_p(basically a head pointer)is incremented each time the app adds 
a packet to the buffer, k_p(basically a tail pointer) is incremented each time 
the kernel takes one packet from the buffer and sends it. 
  If the kernel and the app are perfectly paced, there is no syncronization 
problem since u_p is updated only by the app and k_p is updated only by the 
kernel. But in reality, the kernel can consume data quicker or the app can 
produce data quicker. In the former case, when the kernel detects the 
situation, by comparing u_p and k_p, sets a flag(k_notify) and stops. The app,
after adding a packet to the buffer, makes a system call notify() to notify
the kernel of new data if the flag is set. In the latter case, the app can
do something else and check back frequetly or it can call poll() which blocks
until the app can write new data.
  In the case when the connection experences sudden network congestion and the
congestion window is decreased, the app has two choices. It can mark some 
packets already added to the buffer as deleted, if that is not enough, it can
go back and change packets that is already in the buffer, here is the detailed
procedure.
  (1) move u_p backward, smart app should make sure that there is still a few
      packets between k_p and u_p so that the kernel can still send data 
      without being blocked.
  (2) after u_p is changed, the app checks k_p again(the kernel is always 
      running), and change only those packets that the kernel hasn't sent yet.
      Of course, we do not expect this to happen often since we have left some
      packets for the kernel in (2).
  (3) after the changes, the app resets u_p to the original u_p before these 
      changes and calls notify() if necessary.

  The buffer can also hold sequence numbers and ack information so that the app
doesn't need to ask the kernel explictly for the ack info.

  This shared buffer API has quite some advantage.
  (1) low context switch overhead
    - a smart app is expected to generate data at roughly the same speed as the
      kernel consumes data, if that is the case, no context switch is incurred 
      as the result of congestion window updates.
    - note that in conventional socket API, an application can also carefully 
      pace the data generating speed based on some feedback from ther kernel. 
      But it then needs to write() data in small trunks. In a busy server, 
      system calls can be expensive.
  (2) late decision as to what to send
    - if the app happens to generate too much data, it can cross out some 
      packets or even go back and change some packets.
  (3) zero-copy
    - since the buffer is shared, by temporarily pinning the buffer, ther 
      kernel doesn't need to make a copy of the data before passing it to
      device driver to send.
  (4) high throughput
    - with (1) and (3), this seems to be obvious.

If you want even more details, here it is.:)
 
I.  The shared buffer.

   After calling socket(), the app registers a page-aligned buffer with the 
   kernel by calling setsockopt(). The kernel will map the buffer into kernel
   address space and for the simplest implementation, pins the buffer to avoid
   context switch overhead when this buffer needs to be accessed in interrupt
   or soft interrupt mode. For a better kernel implementation, it pins only part
   of the buffer only when absolutely necessary.

II.  Layout of the buffer 

   The buffer consists of two parts, the first part is decsribed below by the 
   buffer_head structure, the second part is the real packet data.

   struct buffer_head {
   	struct packet 	*kern_p_read;
	int              kern_flag;

	struct packet   *user_p;
	struct packet   *user_p2;
	int      	 user_flag;
	int      	 max_packets;

	struct   packet  pkt[1];
   }

   kern_p_read, read by the app, is a readonly version of kern_p, the real 
   private pointer used by the kernel. Each time kern_p is changed, the kernel
   also updates kern_p_read, so kern_p_read reflects kern_p unless a malicious
   app changes kern_p_read which only hurts itself. 

   kern_flag stores various flags that the kernel wants to tell the app, 
   curretnly only KERN_NOTIFY is defined, if KERN_NOTIFY is set, the app is
   expected to call notify() after it adds a packet to the buffer.

   user_p is the pointer used by the app.
   user_flags stores various flags that ther app wants to tell the kernel, 
   currently, USER_WITH_HEADER is defined, if USER_WITH_HEADER is set, the
   app agrees to leave space for basic ip+dccp header in the packet space.
   Of course, only when the app has euid root, the flag is honored.

   user_p2 is used mainly for moving pointer backward.

   max_packets is calculated by the the app, it stores the maximal number of 
   packets allowed in the buffer.


   Since this buffer_head is shared by the kernel and the app, the kernel 
   implementation has to be defensive as as to protected itself from malicious
   apps.


   The structure of a packet can be descibed as

   strut packet{
   	int      user_seq;
   	int      kern_flag;
	int      user_flag;
   	char 	*data_p;	
   }

   user_seq is used by the app to identify different packets.

   kern_flag stores per packet flag from the kernel, 
    KERN_GOING means the kernel has taken over the pkt and the pkt is about to
      be sent by the device driver.
    KERN_GONE  means the packet has been sent by the device driver.
    KERN_ACKED means the packet has been acked by the receiver 
    KERN_MARKED means the packet has lost or been ECN marked.

   user_flags stores per packet flag from the app
    USER_DELETED means the packet has been deleted by the app, and the kernel
      should not send it.

   data_p points to the real data that is in the second part of the buffer.

III. Operations of the app

   buffer_is_full(){
   	return user_p + 1 == kern_p;	
   }

   app_add_packet(){
   	while buffer_is_full()
		poll()		/* blocking */
	inc(user_p2)
	user_p = user_p2;
	if(kern_flag & KERN_NOTIFY)
		notify()
   }

   app_delete_packet(struct packet *pkt){
   	set_flag(pkt->user_flag, USER_DELETED)
   }

   app_move_pointer_backward(struct packet *dst_pkt){
	user_p = max(dst_pkt, kern_p_read + 3)
	...change packets starting dst_pkt...
	user_p = user_p2	/* orignal position */
	if(kern_flag & KERN_NOTIFY)
		notify()
   }

IV. Operations of the kernel

   buffer_is_empty(){
   	if( kern_p < user_p <= user_p2)
		return FALSE
	else
		return TRUE
   }

   kern_send_packet(){
   	if buffer_is_empty()
		set_flag(kern_flag, KERN_NOTIFY)
	inc(kern_p)
   }
    
In all the description above, inc(), max(), addition and comparison are in 
the cyclic space.

-Junwen, Eddie and Arun


_______________________________________________
dccp IETF mailing list: dccp@ietf.org
list info:  https://www1.ietf.org/mailman/listinfo/dccp
wg charter: http://www.ietf.org/html.charters/dccp-charter.html