jump to navigation

Remote Directory Memory Access August 26, 2007

Posted by linuxwarrior in Linux, Networking.

Written By : Vishal Thanki


In today’s world, where 1Gig/sed ethernet network speed as being very common, the demand of high speed network is rapidly increasding. All the data centers, cluster computers and many more R&D oriented organizations are in need of more speed on ethernet. The problem with the conventional NICs is that they depend a lot on the TCP/IP stack above. Even if the NIC is capable of doing highspeed data transfer, the TCP/IP overhead is keeping Host CPU (the one which comes on mother board) so busy that the machine is not able to pump enough data to NIC and eventually not utilizing the NICs’ capability to transfer at higher speed. RDMA technology can be/is one of the solutions to this problem.

Problems with conventional TCP/IP model

Lets take an example of an application which does the file transfer (using the TCP/IP stack implementation of Linux). This application has two modes: 1. server 2. client.
Server will read the file from disk, client will be waiting for the file to be received and written back to disk. Usually what happens is, server will try to read from the file, copy it into an application buffer, send it to stack, stack will send that buffer to nic, nic will do the transfer, that buffer will be received by peer nic, nic will return it back to tcp/ip stack, stack will return the buffer to client application and then buffer will be written back to local file. There are following scenarios which comes into picture while doing data transfer in this way :

1. read from file and copy it into apps buffer (not a cpu hungry operation)
2. call send() function of socket API

Linux’s tcpip stack is maintained insdie the kernel. The send function called by user will have buffer in user space. To give this buffer to TCP/IP stack (i.e. in kernel), we need to copy it from user to kernel. This copy operation is one of the most CPU hungry operation and hogs the CPU upto 80-85%. Now once the buffer is copied to kernel buffer of TCP/IP stack, the stack will start doing the header processing as of its own (like calculating checksums, fragmenting buffers and maintaining sequence and all that calculating stuffs – again some what CPU hungry operations). After all this processing, the buffer(s) will be given to nic, and nic will do the Transfer operatoin.

3. call recv() function of Socket API on peer guy.

The recv operation does the reverse things which I described for step 2. Again a CPU hungry operation will take place in copying kernel buffer back to user plus all TCP/IP processing.

4. buffer will be written to disk on the local file.

Bottom line, even if the NIC is capable to do data tx at higher speed, the TCP/IP stack plus host CPU will restrict the NIC upto certian speed. This is where RDMA comes into picture.

RDMA concepts

RDMA stands for Remote Direct Memory Access. The NIC which works on this technology is called RNIC. As the name suggest, RNIC has the capability to access the user memory of its own machine as well as (virtually) the peer machine. RDMA techology also requires some of the changes in the underlying hardware. So the RNIC has to have some additional functionalities, like TOE (TCP offload engine – i.e. the whole TCP/IP stack is implemente d into a chip), a 10Gig Ethernet standard output port, DMA controller on board and so many other perfiferals. (I dont know much details about hardware peripherals). The protocol on which RDMA technology works is called iWarp (internet wide area rdma protocol). The APIs which iWarp supports are called iWarp “verbs”. All these verbs cannot be mapped with Socket APIs directly. iWarp verbs need to take some special cares with buffers that user allocates and all that stuff. So any application which is designed by using iWarp verbs, can not communicate to any other application which is not using iWarp. To make these verbs compatible with the other socket applications, vendor needs to give support for Upper Layer Protocols (like SDP, WSD) which is beyond the scope of this article. RDMA has basically two main modes for data transfer. RDMA Read and RDMA Write.

RDMA Read/Write :
In this mode of data transfer, the machine issuing RDMA Read operation will read the memory of the peer machine. In RDMA Write, the machine issuing the RDMA write operation will be writing its own buffer to peer machines memory.

How RNIC works

Lets take the same example of an application doing file transfer, but this time using RNIC.

1. Allocate a user level buffer to read from a file.
2. register this buffer to RNIC using iWarp verbs (not a CPU hungry operation)

registering a buffer to RNIC will return an unique identification called “STag” for that registered buffer. This STags (i.e. ids of registered buffers) has to be exchanged while doing any RDMA operation (i.e RDMA read/write). Actually this registration of buffer is nothing but the mapping of kernel buffer to user buffer (instead of copying it from user space, we can straight away map these buffers to kernel space. this concept is called ZCopy OR Zero Copy).

3. Do RDMA Write (just for example) from server side (As server wants to write file to other machine, we can do RDMA Write from server, otherwise we have to do RDMA read from client side). To do RDMA Write, the machine should have the STags of the peer machine where the data should be written. This initial exchange of STags should be taken care by the application. This process looks very simple but its very very complex at the internal layers. Actually RDMA technology uses TCP/IP underneath. And this TCP/IP has to be in hardware (i.e. TOE) to get the advantage of this technology.

4. One RDMA write is done, buffer we wanted to transfer is written to peer machine.

RDMA read is almost reverse of RDMA write. The host doing RDMA read has to know the the STag of peer machine (i.e. where to read on peer machine).

Basically RDMA read/write performs better if the data size is very large. For small data size, it adds somewhat overhead of exchanging STags and all that. But for lower data size data transfer, we can TOE.

This is how the high performance with lower CPU can be obtained. And it works over ethernet, so there is no need to change underlying infrastructure. The only problem with these technology is that both the communicating party has to be RDMA enabled. If one of the party is not using RDMA, then both has to talk using Native TCP/IP stack.

Thats it!!!!! Please let me know in case of any queries

Contributed by Vishal Thanki. Thanks Vishal!



1. linuxwarrior - August 26, 2007

This is a novel concept. And an enlightening post as well.


2. Vishal - August 26, 2007

Thanks. Btw, the concept is really cool, but linux community is not favoring support for TOE (Tcp Offload Engine) within linux kernel. Here are the (valid, thats what i believe) reasons..



3. Shouman - August 27, 2007

Thats really cool post.Keep going vishal

4. Ruhul Amin - August 27, 2007

Thanks for the wonderfull post.

5. Abhijit - February 7, 2008

so it basically means that the problem of buffer management and TCP stack processing gets pushed down to hardware where TCP/IP is implemented.

Why is this a good idea? What problems does it really solve?

How is this different than adding another CPU to your machine that solely does TCPIP stack processing?

6. Vishal - June 5, 2008

well, the last question of your comment answers the first two questions 🙂 the basic problem was to free up the CPU as much as possible. that was fulfilled by pushing monotonous header processing and generation job to h/w. This will greatly reduce the CPU utilization. so this WAS the __good__ idea to save the CPU.

but recent advancements in cpu architecture shows that researchers are planning to put an extra core in the CPU chip just to handle the network traffic.

so this is just a shifting of work from network card to CPU core. but the basic theme is to free up the CPU by moving header processing, copying buffers to some other processor and DMA which can be on NIC or in CPU core itself.

7. Ugra Bhairav - July 23, 2008

I still do not understand the “kernel bypass” that is usually mentioned as one of the advantages of RDMA technology. (Not in this article though)

When a user space process wants to do a RDMA write, it will copy the data into the preregistered buffer. Now, what happens after that? How does the RNIC know that there is some data that is ready to be sent to the peer? I am confused because I don’t think that a user space process can directly talk to the hardware (is that right?) . So the kernel has to be involved in some kind of signalling to tell the RNIC about the new work request.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: