[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[dccp] shared buffer DCCP API
Hi, All,
As Eddie mentioned, we have been designing a high performance DCCP API
and here it is :) The API is still tentative, any of your suggestions
or comments are very welcome.
------------------------------------------------------------------------------
The core of the API is a shared cyclic buffer between the kernel(as consumer)
and any user application(as producer). There are two pointers to this buffer,
u_p and k_p, u_p(basically a head pointer)is incremented each time the app adds
a packet to the buffer, k_p(basically a tail pointer) is incremented each time
the kernel takes one packet from the buffer and sends it.
If the kernel and the app are perfectly paced, there is no syncronization
problem since u_p is updated only by the app and k_p is updated only by the
kernel. But in reality, the kernel can consume data quicker or the app can
produce data quicker. In the former case, when the kernel detects the
situation, by comparing u_p and k_p, sets a flag(k_notify) and stops. The app,
after adding a packet to the buffer, makes a system call notify() to notify
the kernel of new data if the flag is set. In the latter case, the app can
do something else and check back frequetly or it can call poll() which blocks
until the app can write new data.
In the case when the connection experences sudden network congestion and the
congestion window is decreased, the app has two choices. It can mark some
packets already added to the buffer as deleted, if that is not enough, it can
go back and change packets that is already in the buffer, here is the detailed
procedure.
(1) move u_p backward, smart app should make sure that there is still a few
packets between k_p and u_p so that the kernel can still send data
without being blocked.
(2) after u_p is changed, the app checks k_p again(the kernel is always
running), and change only those packets that the kernel hasn't sent yet.
Of course, we do not expect this to happen often since we have left some
packets for the kernel in (2).
(3) after the changes, the app resets u_p to the original u_p before these
changes and calls notify() if necessary.
The buffer can also hold sequence numbers and ack information so that the app
doesn't need to ask the kernel explictly for the ack info.
This shared buffer API has quite some advantage.
(1) low context switch overhead
- a smart app is expected to generate data at roughly the same speed as the
kernel consumes data, if that is the case, no context switch is incurred
as the result of congestion window updates.
- note that in conventional socket API, an application can also carefully
pace the data generating speed based on some feedback from ther kernel.
But it then needs to write() data in small trunks. In a busy server,
system calls can be expensive.
(2) late decision as to what to send
- if the app happens to generate too much data, it can cross out some
packets or even go back and change some packets.
(3) zero-copy
- since the buffer is shared, by temporarily pinning the buffer, ther
kernel doesn't need to make a copy of the data before passing it to
device driver to send.
(4) high throughput
- with (1) and (3), this seems to be obvious.
If you want even more details, here it is.:)
I. The shared buffer.
After calling socket(), the app registers a page-aligned buffer with the
kernel by calling setsockopt(). The kernel will map the buffer into kernel
address space and for the simplest implementation, pins the buffer to avoid
context switch overhead when this buffer needs to be accessed in interrupt
or soft interrupt mode. For a better kernel implementation, it pins only part
of the buffer only when absolutely necessary.
II. Layout of the buffer
The buffer consists of two parts, the first part is decsribed below by the
buffer_head structure, the second part is the real packet data.
struct buffer_head {
struct packet *kern_p_read;
int kern_flag;
struct packet *user_p;
struct packet *user_p2;
int user_flag;
int max_packets;
struct packet pkt[1];
}
kern_p_read, read by the app, is a readonly version of kern_p, the real
private pointer used by the kernel. Each time kern_p is changed, the kernel
also updates kern_p_read, so kern_p_read reflects kern_p unless a malicious
app changes kern_p_read which only hurts itself.
kern_flag stores various flags that the kernel wants to tell the app,
curretnly only KERN_NOTIFY is defined, if KERN_NOTIFY is set, the app is
expected to call notify() after it adds a packet to the buffer.
user_p is the pointer used by the app.
user_flags stores various flags that ther app wants to tell the kernel,
currently, USER_WITH_HEADER is defined, if USER_WITH_HEADER is set, the
app agrees to leave space for basic ip+dccp header in the packet space.
Of course, only when the app has euid root, the flag is honored.
user_p2 is used mainly for moving pointer backward.
max_packets is calculated by the the app, it stores the maximal number of
packets allowed in the buffer.
Since this buffer_head is shared by the kernel and the app, the kernel
implementation has to be defensive as as to protected itself from malicious
apps.
The structure of a packet can be descibed as
strut packet{
int user_seq;
int kern_flag;
int user_flag;
char *data_p;
}
user_seq is used by the app to identify different packets.
kern_flag stores per packet flag from the kernel,
KERN_GOING means the kernel has taken over the pkt and the pkt is about to
be sent by the device driver.
KERN_GONE means the packet has been sent by the device driver.
KERN_ACKED means the packet has been acked by the receiver
KERN_MARKED means the packet has lost or been ECN marked.
user_flags stores per packet flag from the app
USER_DELETED means the packet has been deleted by the app, and the kernel
should not send it.
data_p points to the real data that is in the second part of the buffer.
III. Operations of the app
buffer_is_full(){
return user_p + 1 == kern_p;
}
app_add_packet(){
while buffer_is_full()
poll() /* blocking */
inc(user_p2)
user_p = user_p2;
if(kern_flag & KERN_NOTIFY)
notify()
}
app_delete_packet(struct packet *pkt){
set_flag(pkt->user_flag, USER_DELETED)
}
app_move_pointer_backward(struct packet *dst_pkt){
user_p = max(dst_pkt, kern_p_read + 3)
...change packets starting dst_pkt...
user_p = user_p2 /* orignal position */
if(kern_flag & KERN_NOTIFY)
notify()
}
IV. Operations of the kernel
buffer_is_empty(){
if( kern_p < user_p <= user_p2)
return FALSE
else
return TRUE
}
kern_send_packet(){
if buffer_is_empty()
set_flag(kern_flag, KERN_NOTIFY)
inc(kern_p)
}
In all the description above, inc(), max(), addition and comparison are in
the cyclic space.
-Junwen, Eddie and Arun
_______________________________________________
dccp IETF mailing list: dccp@ietf.org
list info: https://www1.ietf.org/mailman/listinfo/dccp
wg charter: http://www.ietf.org/html.charters/dccp-charter.html