jump to navigation

Add a new protocol to Linux Kernel December 2, 2008

Posted by linuxwarrior in Linux.
Tags: , ,

Written By : Vishal Thanki

The Linux Network sub-system supports several protocols. It is flexible enough to allow addition of new protocols. These protocols are accessible to user application through socket interface by means of protocol family. The subsequent sections will cover major steps to add a new protocol family (with Linux kernel 2.6.24 as reference). The implementation of the new protocol is not in scope of this document.

Top level view of Linux Network (Kernel) Sub-system:


For the scope of this document, we can consider the network sub-system in Linux consisting of three layers as shown above.
1. The top most “SOCKET” layer takes care of all socket related system calls. It identifies the protocol family and forwards the call to respective protocol implementation.

2.The next layer implements transport and network layer protocols, where we can introduce our new protocol family.

3. The lowest layer is the network controller device driver providing hardware access.

Adding a New Protocol Family:
The Linux kernel network subsystem data structures, “struct proto” (/include/net/sock.h) and the “struct net_proto_family” (/include/linux/net.h) encapsulates the protocol family implementation.
Following step by step code snippets show a simplified example to register the new protocol family similar to TCP/IP stack (using IP as the network layer). Please note, all the protocol specific new functions (to be implemented) has prefix “my_”.
1) Initialize an instance of “struct proto” and register to Linux network sub-system with call “proto_register()”.

/* Protocol specific socket structure */
struct my_sock {
struct inet_sock isk;
/* Add the Protocol implementation specific data members per socket here from here on */

struct proto my_proto = {
.close = my_close,
.connect = my_connect,
.disconnect = my_disconnect,
.accept = my_accept,
.ioctl = my_ioctl,
.init = my_init_sock,
.shutdown = my_shutdown,
.setsockopt = my_setsockopt,
.getsockopt = my_getsockopt,
.sendmsg = my_sendmsg,
.recvmsg = my_recvmsg,
.unhash = my_unhash,
.get_port = my_get_port,
.enter_memory_pressure = my_enter_memory_pressure,
.sockets_allocated = &sockets_allocated,
.memory_allocated = &memory_allocated,
.memory_pressure = &memory_pressure,
.orphan_count = &orphan_count,
.sysctl_mem = sysctl_tcp_mem,
.sysctl_wmem = sysctl_tcp_wmem,
.sysctl_rmem = sysctl_tcp_rmem,
.max_header = 0,
.obj_size = sizeof(struct my_sock),
.owner = THIS_MODULE,
.name = "NEW_TCP",

rc = proto_register(&my_proto, 1);

2) Provide an interface to create the new protocol specific socket creation routine. Register our handler to socket layer using call “sock_register()”. The “family” member specifies the address family for the new protocol.
struct net_proto_family my_net_proto = {
.family = AF_INET_NEW_TCP,
.create = my_create_socket,
.owner = THIS_MODULE,

rc = sock_register(&my_net_proto, 1);

3) The new protocol’s address family is the only interface for user level socket calls to reach the new protocol implementation. The new protocol’s address family AF_INET_NEW_TCP should be added in /include/linux/socket.h. Any socket() call with this new address family will be directed to my_create_socket() function in kernel, and which establishes the use of new protocol stack for all subsequent socket operations.

4) The protocol can be connection oriented or connection less as chosen by the protocol implementer. In the socket creation routine, protocol implementer specifies a “struct proto_ops” (/include/linux/net.h) instance. The socket layer calls function members of this proto_ops instance before the protocol specific functions are called (as defined in step# 1). A typical implementation of the create socket routine for TCP/IP like (connection oriented) new protocol:
static struct proto_ops my_proto_ops = {
.family = PF_INET,
.owner = THIS_MODULE,
.release = inet_release,
.bind = my_bind,
.connect = inet_stream_connect,
.socketpair = sock_no_socketpair,
.accept = inet_accept,
.getname = inet_getname,
.poll = my_poll,
.ioctl = inet_ioctl,
.listen = my_inet_listen,
.shutdown = inet_shutdown,
.setsockopt = sock_common_setsockopt,
.getsockopt = sock_common_getsockopt,
.sendmsg = inet_sendmsg,
.recvmsg = sock_common_recvmsg,

static int my_create_socket(struct socket *sock, int protocol)
struct sock *sk;
int rc;

sk = sk_alloc(PF_INET_NEW_TCP, GFP_KERNEL, &my_proto, 1);
if (!sk) {
printk("failed to allocate socket.\n");
return -ENOMEM;

sock_init_data(sock, sk);
sk->sk_protocol = 0x0;

sock->ops = &my_proto_ops;
sock->state = SS_UNCONNECTED;

/* Do the protocol specific socket object initialization */
return 0;

Remote Directory Memory Access August 26, 2007

Posted by linuxwarrior in Linux, Networking.

Written By : Vishal Thanki


In today’s world, where 1Gig/sed ethernet network speed as being very common, the demand of high speed network is rapidly increasding. All the data centers, cluster computers and many more R&D oriented organizations are in need of more speed on ethernet. The problem with the conventional NICs is that they depend a lot on the TCP/IP stack above. Even if the NIC is capable of doing highspeed data transfer, the TCP/IP overhead is keeping Host CPU (the one which comes on mother board) so busy that the machine is not able to pump enough data to NIC and eventually not utilizing the NICs’ capability to transfer at higher speed. RDMA technology can be/is one of the solutions to this problem.

Problems with conventional TCP/IP model

Lets take an example of an application which does the file transfer (using the TCP/IP stack implementation of Linux). This application has two modes: 1. server 2. client.
Server will read the file from disk, client will be waiting for the file to be received and written back to disk. Usually what happens is, server will try to read from the file, copy it into an application buffer, send it to stack, stack will send that buffer to nic, nic will do the transfer, that buffer will be received by peer nic, nic will return it back to tcp/ip stack, stack will return the buffer to client application and then buffer will be written back to local file. There are following scenarios which comes into picture while doing data transfer in this way :

1. read from file and copy it into apps buffer (not a cpu hungry operation)
2. call send() function of socket API

Linux’s tcpip stack is maintained insdie the kernel. The send function called by user will have buffer in user space. To give this buffer to TCP/IP stack (i.e. in kernel), we need to copy it from user to kernel. This copy operation is one of the most CPU hungry operation and hogs the CPU upto 80-85%. Now once the buffer is copied to kernel buffer of TCP/IP stack, the stack will start doing the header processing as of its own (like calculating checksums, fragmenting buffers and maintaining sequence and all that calculating stuffs – again some what CPU hungry operations). After all this processing, the buffer(s) will be given to nic, and nic will do the Transfer operatoin.

3. call recv() function of Socket API on peer guy.

The recv operation does the reverse things which I described for step 2. Again a CPU hungry operation will take place in copying kernel buffer back to user plus all TCP/IP processing.

4. buffer will be written to disk on the local file.

Bottom line, even if the NIC is capable to do data tx at higher speed, the TCP/IP stack plus host CPU will restrict the NIC upto certian speed. This is where RDMA comes into picture.

RDMA concepts

RDMA stands for Remote Direct Memory Access. The NIC which works on this technology is called RNIC. As the name suggest, RNIC has the capability to access the user memory of its own machine as well as (virtually) the peer machine. RDMA techology also requires some of the changes in the underlying hardware. So the RNIC has to have some additional functionalities, like TOE (TCP offload engine – i.e. the whole TCP/IP stack is implemente d into a chip), a 10Gig Ethernet standard output port, DMA controller on board and so many other perfiferals. (I dont know much details about hardware peripherals). The protocol on which RDMA technology works is called iWarp (internet wide area rdma protocol). The APIs which iWarp supports are called iWarp “verbs”. All these verbs cannot be mapped with Socket APIs directly. iWarp verbs need to take some special cares with buffers that user allocates and all that stuff. So any application which is designed by using iWarp verbs, can not communicate to any other application which is not using iWarp. To make these verbs compatible with the other socket applications, vendor needs to give support for Upper Layer Protocols (like SDP, WSD) which is beyond the scope of this article. RDMA has basically two main modes for data transfer. RDMA Read and RDMA Write.

RDMA Read/Write :
In this mode of data transfer, the machine issuing RDMA Read operation will read the memory of the peer machine. In RDMA Write, the machine issuing the RDMA write operation will be writing its own buffer to peer machines memory.

How RNIC works

Lets take the same example of an application doing file transfer, but this time using RNIC.

1. Allocate a user level buffer to read from a file.
2. register this buffer to RNIC using iWarp verbs (not a CPU hungry operation)

registering a buffer to RNIC will return an unique identification called “STag” for that registered buffer. This STags (i.e. ids of registered buffers) has to be exchanged while doing any RDMA operation (i.e RDMA read/write). Actually this registration of buffer is nothing but the mapping of kernel buffer to user buffer (instead of copying it from user space, we can straight away map these buffers to kernel space. this concept is called ZCopy OR Zero Copy).

3. Do RDMA Write (just for example) from server side (As server wants to write file to other machine, we can do RDMA Write from server, otherwise we have to do RDMA read from client side). To do RDMA Write, the machine should have the STags of the peer machine where the data should be written. This initial exchange of STags should be taken care by the application. This process looks very simple but its very very complex at the internal layers. Actually RDMA technology uses TCP/IP underneath. And this TCP/IP has to be in hardware (i.e. TOE) to get the advantage of this technology.

4. One RDMA write is done, buffer we wanted to transfer is written to peer machine.

RDMA read is almost reverse of RDMA write. The host doing RDMA read has to know the the STag of peer machine (i.e. where to read on peer machine).

Basically RDMA read/write performs better if the data size is very large. For small data size, it adds somewhat overhead of exchanging STags and all that. But for lower data size data transfer, we can TOE.

This is how the high performance with lower CPU can be obtained. And it works over ethernet, so there is no need to change underlying infrastructure. The only problem with these technology is that both the communicating party has to be RDMA enabled. If one of the party is not using RDMA, then both has to talk using Native TCP/IP stack.

Thats it!!!!! Please let me know in case of any queries

Contributed by Vishal Thanki. Thanks Vishal!