jump to navigation

Add a new protocol to Linux Kernel December 2, 2008

Posted by linuxwarrior in Linux.
Tags: , ,

Written By : Vishal Thanki

The Linux Network sub-system supports several protocols. It is flexible enough to allow addition of new protocols. These protocols are accessible to user application through socket interface by means of protocol family. The subsequent sections will cover major steps to add a new protocol family (with Linux kernel 2.6.24 as reference). The implementation of the new protocol is not in scope of this document.

Top level view of Linux Network (Kernel) Sub-system:


For the scope of this document, we can consider the network sub-system in Linux consisting of three layers as shown above.
1. The top most “SOCKET” layer takes care of all socket related system calls. It identifies the protocol family and forwards the call to respective protocol implementation.

2.The next layer implements transport and network layer protocols, where we can introduce our new protocol family.

3. The lowest layer is the network controller device driver providing hardware access.

Adding a New Protocol Family:
The Linux kernel network subsystem data structures, “struct proto” (/include/net/sock.h) and the “struct net_proto_family” (/include/linux/net.h) encapsulates the protocol family implementation.
Following step by step code snippets show a simplified example to register the new protocol family similar to TCP/IP stack (using IP as the network layer). Please note, all the protocol specific new functions (to be implemented) has prefix “my_”.
1) Initialize an instance of “struct proto” and register to Linux network sub-system with call “proto_register()”.

/* Protocol specific socket structure */
struct my_sock {
struct inet_sock isk;
/* Add the Protocol implementation specific data members per socket here from here on */

struct proto my_proto = {
.close = my_close,
.connect = my_connect,
.disconnect = my_disconnect,
.accept = my_accept,
.ioctl = my_ioctl,
.init = my_init_sock,
.shutdown = my_shutdown,
.setsockopt = my_setsockopt,
.getsockopt = my_getsockopt,
.sendmsg = my_sendmsg,
.recvmsg = my_recvmsg,
.unhash = my_unhash,
.get_port = my_get_port,
.enter_memory_pressure = my_enter_memory_pressure,
.sockets_allocated = &sockets_allocated,
.memory_allocated = &memory_allocated,
.memory_pressure = &memory_pressure,
.orphan_count = &orphan_count,
.sysctl_mem = sysctl_tcp_mem,
.sysctl_wmem = sysctl_tcp_wmem,
.sysctl_rmem = sysctl_tcp_rmem,
.max_header = 0,
.obj_size = sizeof(struct my_sock),
.owner = THIS_MODULE,
.name = "NEW_TCP",

rc = proto_register(&my_proto, 1);

2) Provide an interface to create the new protocol specific socket creation routine. Register our handler to socket layer using call “sock_register()”. The “family” member specifies the address family for the new protocol.
struct net_proto_family my_net_proto = {
.family = AF_INET_NEW_TCP,
.create = my_create_socket,
.owner = THIS_MODULE,

rc = sock_register(&my_net_proto, 1);

3) The new protocol’s address family is the only interface for user level socket calls to reach the new protocol implementation. The new protocol’s address family AF_INET_NEW_TCP should be added in /include/linux/socket.h. Any socket() call with this new address family will be directed to my_create_socket() function in kernel, and which establishes the use of new protocol stack for all subsequent socket operations.

4) The protocol can be connection oriented or connection less as chosen by the protocol implementer. In the socket creation routine, protocol implementer specifies a “struct proto_ops” (/include/linux/net.h) instance. The socket layer calls function members of this proto_ops instance before the protocol specific functions are called (as defined in step# 1). A typical implementation of the create socket routine for TCP/IP like (connection oriented) new protocol:
static struct proto_ops my_proto_ops = {
.family = PF_INET,
.owner = THIS_MODULE,
.release = inet_release,
.bind = my_bind,
.connect = inet_stream_connect,
.socketpair = sock_no_socketpair,
.accept = inet_accept,
.getname = inet_getname,
.poll = my_poll,
.ioctl = inet_ioctl,
.listen = my_inet_listen,
.shutdown = inet_shutdown,
.setsockopt = sock_common_setsockopt,
.getsockopt = sock_common_getsockopt,
.sendmsg = inet_sendmsg,
.recvmsg = sock_common_recvmsg,

static int my_create_socket(struct socket *sock, int protocol)
struct sock *sk;
int rc;

sk = sk_alloc(PF_INET_NEW_TCP, GFP_KERNEL, &my_proto, 1);
if (!sk) {
printk("failed to allocate socket.\n");
return -ENOMEM;

sock_init_data(sock, sk);
sk->sk_protocol = 0x0;

sock->ops = &my_proto_ops;
sock->state = SS_UNCONNECTED;

/* Do the protocol specific socket object initialization */
return 0;


Facebook blocking series June 5, 2008

Posted by linuxwarrior in Blogroll.
add a comment

We have to hear about the recent blocking made by facebook. It all started when facebook blocked the google’s friendconnect and the series continues.

If you try to buy an ad on Facebook, there are certain words that are taboo. Any ads that contain four-letter words are automatically blocked. So too are ads with the names of competing social networks “MySpace,” “Friendster,” “Hi5,” , or “Orkut.” (Curiously, “Bebo” and “OpenSocial” go through just fine, as does “Microsoft,” “Yahoo,” “Google,” and “AOL”).

Okay, so Facebook doesn’t want to run ads for some of its competitors. But why is 3Jam blocked? The startup offers an SMS service that lets people multiple text messages at once, and it even has a Facebook app that does the same thing.

CEO Andy Jagoe was befuddled when he tried to create a Facebook ad to test a new product, only to find out that the term “3Jam” was also blocked. (The product actually sounds pretty cool: it will be a way to send and receive text messages for free while you are online, and then route them to your phone when you are offline). Says Jagoe:

It seems crazy to think that they consider us competitive. This is kind of weird. It is like censorship.

It does seem weird. What other startup names or products are blocked by Facebook?

An Open Source Project for Windows Platform – NSIS May 26, 2008

Posted by linuxwarrior in LW Talk, Open Source.

This is not I write for windows people, as a open source enthusiastic, you will be glad to know this project is fully open source which build so many installers for windows platform.Mr. Bill should give thanks to this project.Though there are now automated system(without scripting) in windows to build installer, but this one is totally code base installer system which will give you full control and freedom.Yah, NSIS (Nullsoft Scriptable Install System), it’s a professional open source system to build windows installer. Link : http://nsis.sourceforge.net

My Experience
I don’t know this system before.After starting in my professional career I worked on it for the first time.For a project purpose I have to learn this new thing.The code looks assemble type, so many macros, plugins, functional base activity which great to learn.You can create any new plugins, which is appreciable there.Previous this plugins directory and architecture was not so good, now they improve it lot,currently there are lots of plugins into their directory. here is the link http://nsis.sourceforge.net/Category:Plugins , it’s under their Development Centre. Currently the GUI part is developed properly.
The community behind this project is so active and powerful.All time there are some burning issues are going on if you visit their official forum.There are code examples and some real world installer info into the site.By the way, the official documentation is good for new people.NSIS covered most of the thing from mysql to game explorer, string manipulation, sorting components etc. all are exciting part there.

A nsis installer is not just a simple thing, it’s burning your head if you deeply work on it.For my project there are Java related issues, Apache tomcat and famous Jar file, for installation, which burn me lot.Thanks god for doing this work.If you interest to work using this open source project, please visit NSIS site for detail information.
Have a nice day.

Contributed by Shouman Das

Keep it up Amit May 26, 2008

Posted by linuxwarrior in LW Talk.
Tags: ,
add a comment

Dear Amit,
First of all I say sorry that I can’t keep my promise.I have to share it with all.Dear members, our friend Amit joins in SUN Microsystem, it’s my pleasant it to share it with you.Congrats Amit for this gr8 work, you deserve that.We all are wishing you for successful career there.No more today.Good work…


Remote Directory Memory Access August 26, 2007

Posted by linuxwarrior in Linux, Networking.

Written By : Vishal Thanki


In today’s world, where 1Gig/sed ethernet network speed as being very common, the demand of high speed network is rapidly increasding. All the data centers, cluster computers and many more R&D oriented organizations are in need of more speed on ethernet. The problem with the conventional NICs is that they depend a lot on the TCP/IP stack above. Even if the NIC is capable of doing highspeed data transfer, the TCP/IP overhead is keeping Host CPU (the one which comes on mother board) so busy that the machine is not able to pump enough data to NIC and eventually not utilizing the NICs’ capability to transfer at higher speed. RDMA technology can be/is one of the solutions to this problem.

Problems with conventional TCP/IP model

Lets take an example of an application which does the file transfer (using the TCP/IP stack implementation of Linux). This application has two modes: 1. server 2. client.
Server will read the file from disk, client will be waiting for the file to be received and written back to disk. Usually what happens is, server will try to read from the file, copy it into an application buffer, send it to stack, stack will send that buffer to nic, nic will do the transfer, that buffer will be received by peer nic, nic will return it back to tcp/ip stack, stack will return the buffer to client application and then buffer will be written back to local file. There are following scenarios which comes into picture while doing data transfer in this way :

1. read from file and copy it into apps buffer (not a cpu hungry operation)
2. call send() function of socket API

Linux’s tcpip stack is maintained insdie the kernel. The send function called by user will have buffer in user space. To give this buffer to TCP/IP stack (i.e. in kernel), we need to copy it from user to kernel. This copy operation is one of the most CPU hungry operation and hogs the CPU upto 80-85%. Now once the buffer is copied to kernel buffer of TCP/IP stack, the stack will start doing the header processing as of its own (like calculating checksums, fragmenting buffers and maintaining sequence and all that calculating stuffs – again some what CPU hungry operations). After all this processing, the buffer(s) will be given to nic, and nic will do the Transfer operatoin.

3. call recv() function of Socket API on peer guy.

The recv operation does the reverse things which I described for step 2. Again a CPU hungry operation will take place in copying kernel buffer back to user plus all TCP/IP processing.

4. buffer will be written to disk on the local file.

Bottom line, even if the NIC is capable to do data tx at higher speed, the TCP/IP stack plus host CPU will restrict the NIC upto certian speed. This is where RDMA comes into picture.

RDMA concepts

RDMA stands for Remote Direct Memory Access. The NIC which works on this technology is called RNIC. As the name suggest, RNIC has the capability to access the user memory of its own machine as well as (virtually) the peer machine. RDMA techology also requires some of the changes in the underlying hardware. So the RNIC has to have some additional functionalities, like TOE (TCP offload engine – i.e. the whole TCP/IP stack is implemente d into a chip), a 10Gig Ethernet standard output port, DMA controller on board and so many other perfiferals. (I dont know much details about hardware peripherals). The protocol on which RDMA technology works is called iWarp (internet wide area rdma protocol). The APIs which iWarp supports are called iWarp “verbs”. All these verbs cannot be mapped with Socket APIs directly. iWarp verbs need to take some special cares with buffers that user allocates and all that stuff. So any application which is designed by using iWarp verbs, can not communicate to any other application which is not using iWarp. To make these verbs compatible with the other socket applications, vendor needs to give support for Upper Layer Protocols (like SDP, WSD) which is beyond the scope of this article. RDMA has basically two main modes for data transfer. RDMA Read and RDMA Write.

RDMA Read/Write :
In this mode of data transfer, the machine issuing RDMA Read operation will read the memory of the peer machine. In RDMA Write, the machine issuing the RDMA write operation will be writing its own buffer to peer machines memory.

How RNIC works

Lets take the same example of an application doing file transfer, but this time using RNIC.

1. Allocate a user level buffer to read from a file.
2. register this buffer to RNIC using iWarp verbs (not a CPU hungry operation)

registering a buffer to RNIC will return an unique identification called “STag” for that registered buffer. This STags (i.e. ids of registered buffers) has to be exchanged while doing any RDMA operation (i.e RDMA read/write). Actually this registration of buffer is nothing but the mapping of kernel buffer to user buffer (instead of copying it from user space, we can straight away map these buffers to kernel space. this concept is called ZCopy OR Zero Copy).

3. Do RDMA Write (just for example) from server side (As server wants to write file to other machine, we can do RDMA Write from server, otherwise we have to do RDMA read from client side). To do RDMA Write, the machine should have the STags of the peer machine where the data should be written. This initial exchange of STags should be taken care by the application. This process looks very simple but its very very complex at the internal layers. Actually RDMA technology uses TCP/IP underneath. And this TCP/IP has to be in hardware (i.e. TOE) to get the advantage of this technology.

4. One RDMA write is done, buffer we wanted to transfer is written to peer machine.

RDMA read is almost reverse of RDMA write. The host doing RDMA read has to know the the STag of peer machine (i.e. where to read on peer machine).

Basically RDMA read/write performs better if the data size is very large. For small data size, it adds somewhat overhead of exchanging STags and all that. But for lower data size data transfer, we can TOE.

This is how the high performance with lower CPU can be obtained. And it works over ethernet, so there is no need to change underlying infrastructure. The only problem with these technology is that both the communicating party has to be RDMA enabled. If one of the party is not using RDMA, then both has to talk using Native TCP/IP stack.

Thats it!!!!! Please let me know in case of any queries

Contributed by Vishal Thanki. Thanks Vishal!

Reunion May 26, 2007

Posted by linuxwarrior in LW Talk.
add a comment

Soon,Linuxwarrior is going to arrange another virtual meeting.Currently number of active members are very keen to meet again.Thats why we decide to arrange another one.Hope this will make us so close.Linuxwarrior believes is knowledge sharing and therefore meeting or get together is the main thing on that.We hope our members will participate there with full flooded.

Soon, we declare the date and time.Have a nice day.

New Chief Coordinator — Linuxwarrior April 9, 2007

Posted by linuxwarrior in LW Talk.
add a comment

Linuxwarrior have new Chief Coordinator.Prashanth G joined in 4th April as the Chief coordinator of the Linuxwarrior Group.For members and all,please visit the link for detail information http://linuxwarrior.info/

We wish a very good success for him as the Chief.Happy linuxing…..

On behalf of Technical Team

Mailing list guideliness April 2, 2007

Posted by linuxwarrior in Blogroll.
add a comment

This is a monthly reminder on the mail posting

1. Use a proper subject line.

2. Do not troll in the mailing list

Link: http://en.wikipedia.org/wiki/Internet_troll

3. Use [OT] for off-topic, non-technical discussions.
But, don’t misuse this to start flame wars or to troll
in the mailing list.

4. Do not top-post:

Link: http://en.wikipedia.org/wiki/Top-posting

Example of a top-post:

Because it messes up the order in which
people normally read text.
> > Why is top-posting
> > such a bad thing?
> > What is the most
> > annoying thing in e-mail?

Always use interleaved posting. Trim the post you are
replying to and interleave your replies so that each
part of the reply is under the quoted part that you
are replying to. Further please leave a gap of one
line between the matter you are quoting and your

what is your name?

trimmed and replied as:

>what is your name?

my name is ilugc

and not as:

>what is your name?
my name is linuxwarrior.

5. Do not over-quote:

Example of an over-quote:

>>> On Friday you said
>>>  blah blah blah
>> On Saturday you said
>>  foo foo foo
> On Sunday you said
>  foobar

6. Do not post HTML messages

7. Do not recycle messages

8. Do not send attachments( irrrevelant)

9. Do not attach obnoxious, nonsensical legal
disclaimers. If your company uses disclaimers.

10. Search for answers for your questions/problems in
google.com or any other search engine before posting
your query to the mailing list.

If any of the above are not clear to you, Please read
the detailed list guidelines:


12. Do not post messages in all capital letters. Mails
in CAPS is considered rude and is similar to shouting
during a conversation.

13. Following these guidelines could be helpful

14. Don’t send season’s greetings or birthday invites
to the group. Its not mailing list etiquette.

We must use the above guidelines to maintain better
readability, proper archives, efficient bandwidth
usage, etc.

Union of the Warriors, 12th February February 13, 2007

Posted by linuxwarrior in Blogroll.

Finally we met in yesterday. Yup, skype and IRC. For security reason some members show the objection about IRC but finally we take another step on Skype. For the first time we met several members in skype. Though it’s a informal meeting of Linuxwarrior but after some moments python takes the main part of the conversation. Some simple issues were talked there. Moment, we got a flavor of web application. The meeting duration was 3 hours. 

We give our thanks to prashanth, santosh, abhijit and others for their active participation. This is just a beginning, later on we will discuss about Linux issues in Skype. So, guys, from now, please activate your skype account. We will put our next virtual meeting soon. Need your feedback about that.

So bye for now. Happy Coding.


Month Review – January 2007 January 31, 2007

Posted by linuxwarrior in Blogroll.
1 comment so far

Things are going cool and nice. Linuxwarrior members are now actively worked in the group. In this month members are published some their problems and solutions on Linux, at that moment the group is focusing on Linux but our extreme request for all that please write some extra off track of Linux. Yup, mean the Open Source and others. We know we have gurus on Microsoft and Oracle in the group. The upcoming month we will focus all these issues. Different flavor. So get ready for that.

Linuxwarrior blog got a big step in this month, after taking charge from Arnold to Prashanth, the blog is boost up. We give thanks both of them.

We have sad news also in this month, last day of January KV Prashant took off from his moderator ship. KV, thanks for your nice job. We believe you will be back soon in the old post.  

       We don’t think we will achieve everything in just in one day, that will be miserable. We gain our goals with activity, resources and capability. Day by day , month by month. LW can do –Believe in inner mind.  

Congress, all the warriors and well wishers. Wish you a happy new joy in upcoming days.

Good luck.

Technical Team