Thursday, February 16, 2012

The Power of Parallelism - Part III

In the last post we talked about multiprocessing, where a parent process forked children to work in parallel. However, creating multiple processes to share the load for a common task is not worth the effort, as switching between processes is a highly performance limiting operation for the operating system. You may sometime dive deep into the kernel source to see all the operation a scheduler needs to do while switching between processes.

In order to make parallel computing more efficient, lightweight processes called threads can be created within a scope of a process. Since threads are multiple workers running in parallel within a process, they share the logical addressing space, heap, memory maps, even the open files in the process. Thus, to switch between threads, only the stacks and the respective instruction pointers are to be swapped, which seems to be a much less overhead as compared to switching between separate processes.

However, it should be kept in mind, that threads run in parallel and independent of each other, working over shared global data, and hence synchronizing their operations is equally important. Just as multiple processes use semaphores, threads can, too! And what's more fun is that the semaphores don't need to be in shared memory, as all threads can access the global semaphores.

To work an example out, lets solve the problem stated in the last post using threads. To create threads, we just need to call

pthread_create(&newThread, &threadAttribute, operatingFunction, argumentToFunction);

Here, threadAttribute is a pthread_attr_t variable which decides the behavior of the newly created thread. Now, how do we tell the thread what to do after it is born? Well, not much; it just has to execute the code in the function named operatingFunction which takes a void * argument and returns a void * value. Just because the parameter and the return value are both void pointers, any type of data can be sent to and received from the thread. Here argumentToFunction is the argument we send to the thread. Once the thread is created, its ID is sent back in newThread so that it can be controlled and monitored. More reference of thread related functions can be found here.

The worked out solution to the problem discussed in the last post can be found here. One thing to be noted is that between multiple trials to lock the semaphores, the thread relinquishes control of the CPU by using sched_yield. This stops its continuously hogging the CPU polling the lock, and thus provides chance to other threads to finish their work and eventually unlock the locks. I would suggest you comment the two lines containing sched_yield, compile and execute to check out the drastic fall in performance.


I hope now its clear how easy it is to use threads to improve response time of programs. Here you can find a worked out (and well commented, as well) multi-threaded solution to counting occurances of a specific byte in a file. The screenshot below might interest you to have a look into it ...


N.B. To all who had a gaze into the above program, its pretty awkward to reserve memory to store the entire file. Its equally awkward to read such huge chunks of data in a single call to read. You might find it interesting to make the program a bit more savvy ... remember the synchronizing issues, of course!

Well, that's all for this series of posts related to parallel processing. I hope they prove to be helpful. Keep reading for more ...

Wednesday, February 15, 2012

The Power of Parallelism - Part II

So, we have all dealt with processes, at some time or the other. Even the browser we are reading this in, is a process in itself. If we have a look at the output of ps command, we can find multiple processes running in a system at any given time. We have learnt to write programs which compiles into an executable and then becomes a process when launched. In the last post, we have also talked about launching an executable file to create a new process.

In this post, we will see how we can create processes on the fly. In fact, it's quite an easy thing to do. anywhere in your program write this

1:   int pid;
2:
3:   pid=fork();
4:   if(pid==-1){
5:           /*The system failed to spawn a new process*/
6:   }
7:   if(pid==0){
8:           /*This part is executed by the child*/
9:   }else{
10:          /*This is parent's job to do*/
11:  }

fork  is a system call, and when called from a process, dives deep into the kernel to create a copy of the calling process and return to both of them. If the process creation fails, then it can return only to the parent, and hence returns -1. If, however, the child is born successfully, there are now two processes to return to! So it returns 0 to the child, and the child's process id to the parent.

Now, being independent of each other, the two process can finish their job at any time and exit. But if the parent process is really concerned about the child's exit status, and must wait to clean up anything its children might have messed up with, it can do so with a call to wait. If any child forked by the caller is running, the system call wait makes the caller sleep until it exits. If no such child exists, the call returns -1 immediately.

A much awaited query follows the discussion inevitably ... now that we have two processes, can they talk to each other? Well, of course they can! Inter process communication systems (ipc, in short) saves the day. Two processes can share data using sockets, message queues or shared memories, and in this post we will deal with ipc using shared memory.

A process can create a shared memory using

int shmid = shmget(key_t key, size_t size, int shmflg);

which returns the ID of the newly created shared memory segment, or -1 if it fails. Such a shared memory created can later be destroyed by

shmctl(shmid, IPC_RMID, NULL);

Once a shared memory segment is created, the processes need to attach to it, to be able to use it. A call to shmat does this. Similarly, shmdt detaches the virtual memory area corresponding to the shared memory segment from the memory map of the calling process.

Having said all this, its time to bring all these into action. We can think of a problem, where there is a shared variable, and each process monitors it. They check the value and go to sleep. If the find the value unchanged after waking up, they stay happy. Otherwise, hard luck! Writing a program to do that should not be that tough, right? So here it goes ....


As it is clearly seen, all the children can read and write the shared memory just as expected, and that leads to yet another problem ... synchronizing their activities. This is required as all of them are now working on a common data, and its quite possible for one to overwrite data that is being used by another.

So, how do we synchronize the access to the variables? Nothing much, we add two semaphores to protect the two variables. Semaphores can be thought of as counters, which count down to zero when locked (or waited upon), and once the value reaches zero, the caller is blocked unless someone else unlocks (or posts) to increment the value.

We tune the code, therefore, so that each process locks the semaphores first (well, a small amount of rest in between the two locks, the reasons for which will become clear shortly) and then go to sleep. Upon rechecking the value after waking up, the semaphores are unlocked, so that others can try their feat, too. However, which semaphore to lock first, is decided at random. The modified code here was tried out a few times


What happened now? Both the children were happy in the first case, while in the second case, when 10 children were created, they just hung! We had to kill the system!

This is because, because of the randomness in the order of locking and the sleep in between, a situation might have occurred, where a child acquired prot_one semaphore and requested prot_two, while someone else held prot_two while asking for prot_one! A deadlock!

Well, the writers of real-time libraries must have thought of this in advance, and so we do have a workaround ... a sem_trywait! It is not so aggressive as sem_wait and hence does not block if the semaphore is already blocked by someone else. Using this trick, we rewrite the code once again


Voila! it works perfectly now! All the children are finally happy ...

So, in this part, we have discussed creating children, sharing data and synchronization ... the keys required for parallel processing. But creating processes to carry out multiple tasks is an inefficient way of doing things, because, switching between process takes up a lot of processing power. As we will see in the next post, there is another lightweight way of doing this ... so keep waiting for threads to take over ...

Sunday, February 12, 2012

The Power of Parallelism - Part I

It would not have been nice if we were still living in the days when computers could do only one thing at a time. Most of us enjoy maintaining a messy taskbar, full of text editors, web browsers, file system explorers and what not. And the world of parallel computing does not seem to limit itself to creating and running multiple tasks, but multiple parallel works inside a single task. Don't you enjoy the Tabbed Browsing features of Firefox (or Chrome) ?

So I decided to write a trilogy of posts which cover the basics of multitasking. This is the first in the series which talks about spawning a process from another. Well, the idea is to demonstrate what a shell (e.g. bash) does when we type in a command. In here, we will write a program which will launch gcc to compile a C program, and if compiled successfully, will launch the output executable.

Ok, to begin with, I wrote the C source code which will be compiled. It just needs to be a hello world sort of a code, so I just typed the following down and named it sample.c

1:   #include <stdio.h>
2:
3:   int main(){
4:           int data=0xdeadcafe;
5:
6:           printf("This is the proof that I was executed: %x\n",data);
7:           return 1;
8:   }


Alrite, now we need to cook something up, which can launch gcc to compile this. This can be done by the system call execve. The syscall execve launches an executable when the path to the executable, and the input arguments to the same, are passed as parameters to it. But, there is a catch ... execve never returns if it successfully launches the new task. So, we need to call fork before we call execve [We will deal with fork in the next post]. To launch gcc, therefore, we need to do something like this

1:   char *argv_to_gcc[]={"gcc","-o","sample.out","sample.c",NULL};
2:   int pid,status;
3:   
4:   pid=fork();
5:   if(pid){
6:           wait(&status);
7:   }else{
7:           status=execve("/usr/bin/gcc",argv_to_gcc,envp);
8:           return status;
9:   }


The return value from gcc will be present in status, which we can check to decide whether the executable sample.out was built or not. Note that the return value can be obtained by examining bits 8-15 in status. If gcc returned 0, we launch sample.out like this

1:   char *argv_to_output_executable[]={"sample.out",NULL};
2:   int pid,status;
3:
4:   pid=fork();
5:   if(pid){
6:           wait(&status);
7:   }else{
8:           status=execve("./sample.out",argv_to_output_executable,envp);
9:           return status;
10:  }

Putting it all together, the program launcher.c was written to do all this.


I hope that makes clear how shells work for us to launch programs. In the next post, we will talk about spawning processes using fork and how to synchronize the parallel operations on shared memory.

Thursday, January 26, 2012

Compiling a Kernel

Any kernel-compile by a starter is not its worth unless you can see the changes for yourself. Really speaking, how many of us could really detect the changes from kernel version 2.6.xx that were brought into 2.8.xx [of course you are not allowed to peek into the change logs]? So I decided to edit a kernel source code and compile it, so that the changes are really visible to the user space.

We should be aware that any user space program, upon completion calls library function exit(), directly or indirectly, so that the kernel can spew it out of the list of processes. This function, is actually a wrapper around a system call sys_exit. So, how about we edit the function sys_exit so that it prints a message at the controlling terminal of the calling process? Now, wait! Not every process has a controlling terminal, right? Like daemons, and tasks like init? Well, they can print their goodbye messages in the system log, for that!

Having said all that, we now need the kernel source tree. But which one, then? There are loads of them ... Well, the safest way is to get the one nearest in version to your existing kernel. So, I just took a look at the kernel version of my system:

$ uname -r
2.6.32-32-generic

and went about to download a 2.6.32.xx kernel from here. The download will be a tarball of the source tree compressed with one of gzip, bzip2 or xz algorithms, so you can download any one of the .tar.gz, .tar.bz2 or .tar.xz files.

Once this was done, the source tree archive was extracted. This can be done by the traditional tar command as shown below


or in these days of X and gnome (or any other desktop manager), there will be many archive manager GUI tools out there to do that for you.

Ok, now we edit the function sys_exit(). It can be found in kernel/exit.c, well not exactly as sys_exit(), but as SYSCALL_DEFINE1(exit, int, code)! Any SYSCALL_DEFINEx's you find here are mere preprocessor directives [defined in /include/linux/syscalls.h] to make the code more readable.
Note
If you really want to dig into the kernel code to see how it works, make friends with grep -R.

We can see that sys_exit()  does most of the useful work in do_exit(). Lets insert this bit of code there at the start of do_exit()

1: char go_down_msg[64+TASK_COMM_LEN];
2: struct tty_struct *tty_curr=get_current_tty();
3: if(tty_curr==NULL){
4:         printk(KERN_ALERT "%s going down with code = %ld. no tty found.\n",current->comm,code);
5: }else if(tty_curr->ops==NULL){
6:         printk(KERN_ALERT "%s going down with code = %ld. no tty operations found.\n",current->comm,code);
7: }else if(tty_curr->ops->write==NULL){
8:         printk(KERN_ALERT "%s going down with code = %ld. no tty write operation found.\n",current->comm,code);
9: }else{
10:        snprintf(go_down_msg,64+TASK_COMM_LEN,"\r\n%s going down with code = %ld.\r\n",current->comm,code);
11:        tty_curr->ops->write(tty_curr,go_down_msg,strlen(go_down_msg));
12: }

Now we are ready to compile our new kernel. We open the Makefile and edit the first few lines to name our new kernel linux-2.6.32.27-sys_exit-modified. Also we need to configure our new kernel before we build it. The easiest way to do this will be to copy the configuration file of our existing kernel and use it.

Note
make oldconfig may ask for your choices for a few cases, and its better to google for a safe choice!

The next few steps are quite simple. We need to build the kernel now. And then install it, and the accompanying modules. So, let's do it. And yes, these are going to take quite some time; so let's get some popcorn!

$ make
$ sudo make modules_install
$ sudo make install

Now  got three files in /boot directory:
  • vmlinuz-2.6.32.27-sys_exit-modified
  • System.map-2.6.32.27-sys_exit-modified
  • config-2.6.32.27-sys_exit-modified
and the modules in /lib/modules/linux-2.6.32.27-sys_exit-modified/

Exited? Well, we are almost there to boot into our new kernel. Just a few more steps ... We need to create a ramfs file system for our kernel to boot. this is required to help the kernel with the initial stages of booting where the modules required for accessing the file system has not yet been loaded. After that, the grub menu needs to be updated so that we can boot into our new kernel. All these can be done by

$ cd /boot
$ sudo mkinitramfs -o initrd.img-2.6.32.27-sys_exit-modified 2.6.32.27-sys_exit-modified
$ sudo update-grub

Now, we can reboot and start our new kernel. And if you get to see a lot of messages rolling down the console, you have made it successfully! Once inside, we get to see our new kernel in action. So just open a terminal, and enter any command. And well, I got this! "ls going down with code = 0."! that was real nice a feeling!


Well, Thats all for now ... Keep exploring the kernel source tree!

Saturday, January 21, 2012

A Loadable Kernel Module ...

This is intended to be a motivation towards writing Loadable Kernel Modules (LKMs) and the power they can equip the user with, under the hood of course. The module itself has been kept as simple as possible, while codes for error checking and synchronization have completely been stripped off. You can definitely try this at home ... while learning the other important aspects all along.

To start with we write a simple C program named edit_passwd.c, which tries to open the file /etc/passwd in write mode. Provided the program is run under non-superuser privilege, it is sure to fail, as the file can be opened only by root. You can easily verify this


So, when edit_passwd was executed, we got something like this.

Needless to say, we did not insert any module before pressing the key. 

Now, we can write a module which would change the credentials of this task so as to enable it to open this file. A sample module which did this for me can be found here. Now it was time to compile this module. To do this, we create a Makefile [To know about makefiles and make utility, check this out] in the same directory as the module source file. The sample makefile can be downloaded from here. Once this is done, we can build the module by saying 

$ make


Now, its time to see this module in action. We start our demo program edit_passwd, and while it waits for the user to press a key, we insert the module by saying

$ sudo insmod module_demo.ko

and then press the key. And what! The program now opens the protected file!


So, what is it that we do in our module that empowers the task? Well, lets look into the module's init function hello() [A init function is called when the module is loaded] a bit.
  • When it is being inserted, it gets the pointer to the task_struct in current. Then it iterates over the list of all tasks to find the task named edit_passwd
  • Once the task is located, it changes fsuid element in the cred and real_cred structures. This element is checked by kernel for verifying privilege while doing VFS operations, i.e., file handling. Since it has been changed to the uid of root, any file which can be opened by root can now be opened by the task.
I hope you could see the catch by now. Everytime we need to do this for a file with a different name, the module will need to be recompiled. All these could be avoided if the task could directly communicate from user space to the module in kernel space. Well, not so tough, though ... we can do this by sysfs or procfs interfaces. Let's leave it to the readers, rite? And yes, this will help you for sure.

Wednesday, January 18, 2012

Function on the fly ...

When I was undergoing a course on Calculus in college, I often felt the urge to plot the graphs of various functions, just to check them out how they look like, to have a feel. So, back then I wrote a small program which evaluated the values of y for a set of values of x and then plotted them. The mathematical expression for y was written into a C function which was called by main() while iterating over the values of x.

But soon, it started to irritate me! Because, every time I wanted to do this for another expression, I had to rewrite the function, compile and link it! But I had no choice. The least I could do was to write a parser which read the expression from the input, build a parse tree (dont know much about parsing? check it out here), and then recursively traverse that to evaluate y. Well, no! I wanted something in machine code. Yes, you got it; I wanted a mini-compiler inside my program, which would read an expression string and create a function online to be called again and again.

The rest of this post illustrates an example of how to do it. To keep it short and simple, a binary operation will be parsed. Also, this example will be absolutely non-portable. That means, in short, that it will not even compile on a non-x86-non-unix machine. Well, casting it to suit a particular architecture/platform wont be very hard, I believe! For reference to x86 instructions, refer here.

And one more thing! This is going to be extremely dirty down there. Please hold on to it with patience (and this).

The first thing we require is allocate memory for the function that will be generated. Since we are going to use mprotect() to obtain write (and then read/exec) permissions to this memory, its better to allocate an entire page for this. posix_memalign() will do that for us.

#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>

void *ptr;
posix_memalign(&ptr, getpagesize(), getpagesize());
mprotect(ptr,1,PROT_WRITE);

The next job will be to write the function in the memory pointed to by ptr. create_function() will do it when called. We will deal with this function sometime later. After we manage to do that, the page should be made read and execute only and assigned to an appropriate function pointer.

char operator; //this can have '+' or '-'int (*operation_function_created_online)(int, int);

create_function(ptr, operator);
mprotect(ptr,1,PROT_READ|PROT_EXEC);
operation_function_created_online=ptr;

This done the function can be called just like any other C function.

#include <stdio.h>

int result,val1,val2;

result=operation_function_created_online(val1, val2);
printf("operation %s = %d\n", expr_string, result);

Okay, take a break, get a beer! The next part tells what create_function() does for us.


Coming back, a C function of the type as declared here

int function(int val1, int val2)

will be called like this:
  1. the arguments are passed in stack right to left. so val1will be pushed and then val2 will be.
  2. The function is called, so the next element in the stack is the return address.
  3. The function, in its first instruction, will push ebp to stack, to save its value, so that the stack can be addressed using this register.
  4. The return value is sent back in eax.
  5. The original value of ebp is popped back before returning.
So, the typical caller-callee instruction sequence is something like

Callee
Line numberOpcodeInstruction
155pushl %ebp
289 e5movl %esp, %ebp
... rest of the function ...
... store result in EAX ...
85dpopl %esp
9c3ret
Caller ... some text removed ... 6: mov op2, %eax 7: push %eax 8: mov op1, %eax 9: push %eax 10: call Callee ... do something with EAX ... ... some text removed ...

In our case, we need to have a function like this

1:  push %ebp  
2:  movl %esp, %ebp  
3:  movl 8(%ebp), %eax  
4:  movl 12(%ebp), %eax  
5:
6: ;do the operation here to store result in EAX
7: 8: pop %ebp 9: ret

All that we need to do in create_function() is to fill the memory with these instructions and the intended operation. Once we are done, the program is ready to compile our expression. A sample program can be downloaded from here (modified to parse all elementary binary operations).  The output from this program looks like


This is a very simple illustration of how a compiler works. Hope this motivates the reader towards further studying compiler design in depth.

Monday, January 16, 2012

Memory Protection

It was a lazy afternoon when I was reading a book on Vedic Mathematics. So I asked a friend sitting next to me:

"Hey, how fast can you add 123456789 and 987654321?"

Clever as he is, he quickly opened the calculator application in his desktop, and started typing. To that, I screamed:

"Hey, you are not supposed to use calculators! And anyways, do you think your computer is always right?"

"Well, of course! As far as calculations are concerned! And its just an addition!"

"Ok, how much for my computer adding 2 and 2 to give 5? or 6? or anything I want?"

"You can bet a 1000 bucks if you want! I just love easy money! :P"

So, I started off with it. I wrote a program to add 2 and 2 like this

1:   /*Listing of boozed_desktop.c
2:     It adds 2 and 2 and then faints*/
3:
4:   #include <stdio.h>
5:
6:   void do_magic(void);
7:   
8:   int main(){
9:           do_magic();
10: printf("I want to print 4, and printed %d\n",2+2);
11: }

Ok, so it adds 2 and 2 in the printf() statement. That should convince him. And that means I have to write the function do_magic() for the real trick.

Well, to start off, every operation must be having in its operands somewhere stored in memory. My job should be to change the value of any one of the two operands, so that the addition gives a result I want it to produce. To do that, I need to check where in memory are the two operands (the two 2's) stored. Well, a disassembly can do that for me. That's how the function main() looks when disassembled.

1:          pushl       %ebp  
2:          movl        %esp, %ebp  
3:          andl        $-16, %esp  
4:          subl        $16, %esp  
5:          call        do_magic  
6:          movl        $.LC0, %eax  
7:          movl        $4, 4(%esp)  
8:          movl        %eax, (%esp)  
9:          call        printf  
10:         leave  
11:         ret  

So, at line 5, I see do_magic is being called, and in line 6, .LC0 which must be containing the string literal "I want to print 4, and printed %d\n" to be passed to printf(), is being stored at eax. But I could not find any add operations in there. Well, then I realized that gcc must have added them automatically and put that statement

movl $4, 4(%esp)

at line 7. So, now I have to directly change this value inside the function do_magic()! But, how? How am I supposed to know the address of this movl statement in there?

Well, of course I can! When do_magic() will be called, the address to the next instruction will be pushed to the stack. Then, the first instruction in that function will push the value of ebp in the stack to save its contents. Next, the value of esp will be stored in ebp. Confused, how? Well, That is what every function is compiled to do. Check out lines 1 and 2 in the disassembly of main(). That says it, right?

So, after the first two instructions, ebp will contain the address of the beginning of the stack, where previous value of ebp is saved. Just above that (yes, stack grows downwards) is the address of line 6 of main(), that is the address of

movl $.LC0, %eax

Once I know the size of this instruction, I can calculate the location of my desired statement. Now, how can I find the size of this instruction? Ok, lets compile this code and get the hex dump of the object file. This can be done as

$ gcc -c -o boozed_desktop.o boozed_desktop.c
$ objdump --disassemble boozed_desktop.o

And the object dump will look like this


The highlighted line is the one we are looking for. So, its 5 bytes in length. And not just that, the next line is where we need to do the change. That instruction is of length 8 bytes. Now, a little intuition says that 0xC7 0x44 0x24 0x04 is the opcode to move an immediate value to the address specified by 4(%esp). the trailing 0x04 says it all. So, the "4" we need to change is written as a 32 bit int value in little-endian format as 0x04 0x00 0x00 0x00. That's it! So, summing it all up, do_magic has to do these in order:

  1. Get the return address from stack by using ebp.
  2. Add 5 bytes + 4 bytes to that address to get the address of "4".
  3. Modify it!
Is that all? No, not quite ... Actually, the memory region we are going to modify is holding the executable code. So this memory page must be read-and execute-only page. And we need to have the write permissions to this page. No problem! mprotect() can do that for me. And for that, we will send mprotect() the address of the starting location of this page [which is done by address & (~(getpagesize()-1))]. This will be done by 

mprotect((unsigned char *)((unsigned int)address & ~(getpagesize()-1)),1,PROT_READ|PROT_EXEC|PROT_WRITE);

Oh, you got to notice that! mprotect() was asked to grant all permissions on this page. This is because if do_magic() is in the same memory page as main() [which it will usually be], with only the PROT_WRITE permission (and hence PROT_READ and PROT_EXEC revoked), the call to mprotect() will never be able to return!

Okay, saying all that, I just wrote it down to a function, and added a bit of salt and sugar, to cook up this up

1:   #include <stdio.h>  
2:   #include <unistd.h>  
3:   #include <sys/mman.h>  
4:    
5:   void do_magic(void);  
6:    
7:   int main(){  
8:           do_magic();  
9:           printf("I want to print 4, and printed %d\n",2+2);  
10:  }  
11:    
12: void do_magic(void){ 13: int *var; 14: unsigned char *retloc; 15: int number; 16: 17: printf("How much do you want \"2+2\" to print?\n"); 18: scanf("%d",&number); 19: 20: __asm__ volatile("mov 4(%%ebp), %0\n\t":"=r"(retloc):); 21: mprotect((unsigned char *)((unsigned int)retloc & ~(getpagesize()-1)),1,PROT_READ|PROT_EXEC|PROT_WRITE); 22: *((int *)(retloc+(5+4)))=number; 23: }

Well, that should be enough to irritate my friend! Sure he will fire up when he gets to see this:


Now, just in case, if somebody is thinking that these can only be used to goof up with systems, in the next post, we will see how such memory protection schemes can be put to some good.

Saturday, January 14, 2012

The C in every Matlab: mex

Most of the engineers require to perform certain simulation experiments in their career, and Matlab proves to be the application of preferred choice to most (if not all) due to the ease of programming and the numerous libraries of matrix manipulation it exposes to the user. However, the only factor which hinders rapid development is the speed with which it executes (or rather say interprets) programs. This is because, Matlab reads a line of code, interprets it, and then executes it, which is much slower as compared to a compiled code.

So, one day I decided to make my Matlab programs execute faster. I read somewhere about mex and wanted to do it myself. The following discussion talks about how to use mex files and enhance the performance of a Matlab program. It is to be noted that in my Linux system running on an x86 machine, Matlab was installed at /matlab2009 and the programs were written and compiled keeping that in mind. One might need to make necessary modifications to those listings and commands to suit one's own bill.

Well, to begin with, what exactly is a mex file? Well, it is just a shared library that Matlab links with upon loading, and the functions contained inside can be accessed from Matlab. So, to start with, we require to configure Matlab to successfully compile a C file and generate a mex file. This is done by executing the following command in Matlab console

>> mex -setup

This will display the available C compiles in your computer and will ask the user to choose one from them. It then accordingly sets the mexopts.sh configuration file in the user's home directory which will be reffered to when compiling a C program.

Once this is done, we are all set for writing C programs to be compiled into mex files. A sample file can be downloaded from here.

This function takes two matrices as input, multiplies and divides them element wise and return the result in two output matrices. To test the performance enhancement, we wrote three .m files

The profiler is turned on, and the file Test.m is executed. This is done by 

>> mex sample_mex.c
>> profile on
>> Test
>> profile viewer

And we can see from the profiler output that the mex file has reduced execution time from 4.6 seconds to 0.13 seconds!


Now, lets have a look into the file sample_mex.c. The only function in the file is defined as

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

This is the entry function when the function sample_mex(inparm_one , inparm_two) is called from matlab. Any matrix in Matlab is represented as an mxArray inside a mex function. The input matrices passed from matlab are send in an array of mxArray pointers called prhs and the number of input arguments is sent in nrhs. The output matrices post calculation is sent back in plhs and the number of output variables passed back is indicated in nlhs. The dimensions of a matrix can be found out by using functions mxGetNumberOfDimensions() and mxGetDimensions().

In lines 31 to 42, a check is made whether the input matrices are acceptable to the function. Checks are made to ensure that the matrices

  • are of equal dimensions. 
  • are numeric. This is done by using mxIsNumeric().
  • are of type double precision floating point numbers. This is done by mxIsDouble().
It should be noted that the output matrices to be returned should be created inside the mex file by using mxCreateNumericArray(). Also, the data elements of an mxArray cannot be accessed directly. Therefore, it is required to get the storage address of the data by using mxGetPr() and upon calculation, set the data of the mxArray by using mxSetPr(). This is done in lines 44 to 58. 

One more point is to be made here. The data passed in a matrix is represented in the form of a linear array inside a mex file. The size of the array is equal to the product of the  dimension sizes of the matrix. So, if we create a matrix in Matlab,

>> a = ones(300 , 600 , 4);

the size of the array will be 300 x 600 x 4 = 720000. And an element a(i , j , k)  of the matrix can be accessed as this (keeping in mind that indexing starts from 0 in C):

*(data_of_a + ( k - 1 ) * 300 * 600 +( j - 1 ) * 300 + ( i -1 ))

So, putting them together, the calculation was done in lines 57 to 60.

That's it. Hope this motivates readers towards writing short mex files and save quite a few hours of their work. The complete reference to Matlab API for C can be obtained here for further details.



Well, if you think that's the end of the story, you got it all wrong. Just like you can call functions from libraries written in C inside Matlab, you can let Matlab do specific tasks for a program written in C. For this you require to prepare your system by

  • installing the C shell, csh. This is required as engOpen() executes Matlab from csh.
  • creating a link named matlab to MATLAB executable in one of the directories in the search path of csh (namely /usr/local/bin or /usr/bin). For example, in my system, this was done by 

$ ln -s /matlab2009/bin/glnx86/MATLAB /usr/bin/matlab

All this done, and you are ready to write C programs which use Matlab functions. For example, this program starts the Matlab engine and calls Matlab function plot to show a sine wave animation. All you need to remember is to include the headers from the correct directory and link against Matlab libraries. To know more about compiling in gcc, check this out. In my system it was done as:

$ gcc plotter.c -I/matlab2009/extern/include -lm -L/matlab2009/bin/glnx86 -lmat -lmex -leng -Wl,-rpath,/matlab2009/bin/glnx86 

And that's it! The output was there


Well, that's all for today. Keep reading ...



Thursday, January 12, 2012

Compiling a C program ...

Sometime back I got hold of a third-party library which had a file libsample.so (the shared library) and another named sample.h, the header containing the prototype declaration of the functions contained in the library. I put them both in my home directory /home/subhajit and started writing a program sampletest.c in my home directory to test it.

1:   #include <stdio.h>  
2:   #include <stdlib.h>  
3:   #include <sample.h>
4:   #include <math.h>  
5:    
6:   int main(){  
7:           int var1=2  
8:           int var2=4;  
9:           int var3=var2++var3;  
10:          double var4;  
11:    
12:          /*this inline assembly is of no use.*/  
13:          /*just added to have some flavour*/  
14:          __asm__("push %eax\n\t" "mov %eax,$1\n\t" "pop %eax");  
15:    
16:          var4=pow(add(2,6),0.25);  
17:          printf("%lf\n",var4);  
18:  }  

Once done, I start to put the code together. The first step would be to preprocess the code to expand all the header files and the preprocessor directives.


Whoa! I could very well see the file sample.h in the directory listing. Well, then gcc was never told to look for a header file in the current directory! We need to tell it now, and we do it like this


Now that the preprocessing succeeded, we can have a look into the output file sampletest.i.

1:   # 1 "sampletest.c"  
2:   # 1 "<built-in>"  
3:   # 1 "<command-line>"  
4:   # 1 "sampletest.c"  
5:   # 1 "/usr/include/stdio.h" 1 3 4  
6:    
7:   ... Many Lines Removed ...  
8:    
9:    
10: typedef _G_fpos_t fpos_t; 11: 12: ... Some Lines Removed ... 13: 14: extern struct _IO_FILE *stdin; 15: extern struct _IO_FILE *stdout; 16: extern struct _IO_FILE *stderr; 17: 18: ... Some Lines Removed ... 19: 20: extern int fprintf (FILE *__restrict __stream, 21: __const char *__restrict __format, ...); 22: 23: extern int printf (__const char *__restrict __format, ...);
24: 25: ... Many Lines Removed ... 26: 27: # 1 "/home/subhajit/sample.h" 1 28: 29: extern int add(int,int); 30: # 4 "sampletest.c" 2 31:
32: int main(){ 33: int var1=2 34: int var2=4; 35: int var3=var2++var3; 36: double var4; 37: 38: 39: 40: __asm__("push eax\n\t" "mov %eax,$1\n\t" "pop %eax"); 41: 42: var4=pow(add(2,6),0.25); 43: printf("%lf\n",var4); 44: }

Ok, so we can see the files stdio.h, stdlib.h and sample.h (and all the files they include) have been expanded (some of the prototype declarations are shown in gray). Also, the comment in our program has been removed.

Next, we need to compile the program to convert it from C to machine language, which in my case would be an x86 assembly program. we can do it like this:


Not again! Now, all these jargons of syntax and semantics ... So, the preprocessor can't detect this,rite? Alrite, we got the errors ...  the missing semicolon at line 7 and the extra '+' in line number 9. Corrected them, preprocessed and compiled again, all to have a sampletest.s in my home directory. WHOOO!!!

With that, I was ready for assembling. So I ran the assembler, as.


Just what I was waiting for! Another error! And this time its the assembler. Which waits to catch errors in inline assembly codes or any inappropriate changes made to the generated .s file. In this case, the error is just that we can not move the value of eax into an immediate operand, while the reverse is true. So we change "mov %eax, $1\n\t" to "mov $1, eax\n\t" in line 14 of sampletest.c and obtain the binary file in machine code with all the symbol tables and relocation directives. This is what we usually do by

gcc -c -o sampletest.o sampletest.c

Well, well! All that is left is linking the object file to the shared libraries and get the binary executable. This time I will be extra cautious and will run ld, the linker.


so the linker command is quite long and is given below

ld -dynamic-linker /lib/ld-linux.so.2 -o sampletest sampletest.o /usr/lib/crt1.o /usr/lib/crti.o /usr/lib/crtn.o `gcc -print-file-name=`/crtbegin.o `gcc -print-file-name=`/crtend.o -L/home/subhajit -lc -lm -lsample -rpath=/home/subhajit
See how the linker options have been written in separate lines using '\'. Lets take the options one by one.

  • -dynamic-linker /lib/ld-linux.so.2 is required to be linked to the executable, because this .so file loads the program into memory and takes care of the relocation of the sections when the program is executed.
  • -o sampletest is the executable name to be created and sampletest.o is the source object file
  • /usr/lib/crt1.o /usr/lib/crti.o /usr/lib/crtn.o are the C runtime object files which come with the routines _start, _init and _fini (actually a lot more functions, just do a objdump --disassemble /usr/lib/crt1.o to peek into its internals). These functions are the one which call main() and when main() returns, call _exit() to return control back to kernel.
  • `gcc -print-file-name=`/crtbegin.o `gcc -print-file-name=`/crtend.o are required to link the program against gcc object files. `gcc -print-file-name` gives the absolute path to these objects (in my case it was /usr/lib/gcc/i486-linux-gnu/4.4.3/).
  • -L/home/subhajit tells the linker to look for the libraries in /home/subhajit in addition to the default directory for libraries; and -lc -lm -lsample tell that the libc.a, libm.so (for pow() function) and libsample.so (for add() function) are the libraries to be linked.
  • Finally, when the program will be executed, it will require to load libsample.so into memory which is not present in the standard library path (which may be /usr/lib or /lib). So, -rpath=/home/subhajit tells that the program can load libraries from /home/subhajit at runtime.
This hectic task of linking the sampletest.o to sampletest (with execute permission, verify that with ls -l sampletest) is generally done by

gcc sampletest.o -L/home/subhajit -lsample -lm -Wl,-rpath,/home/subhajit

So, all we can say after this, is that gcc takes care of all these task by preprocessing, compiling, assembling by internally calling as and linking through ld, and thereby saves the day for us. and all we care to do is:



Tuesday, January 10, 2012

Using inline assembly ...

Disclaimer
This is not intended to be a guide towards using inline assembly, and therefore does not cover the topic in full. This can be treated as merely a motivation towards using inline assembly. To explore the topic in depth, please read this.

I am, and always was really scared of assembly language. When I was an under-grad student, my x86 assignments were mostly codes written in C and submission of the disassembly. Of course, with C being pretty low-level, one does not need to dirty their hands (no offence meant to assembly programmers) with the syntax of mnemonics and opcodes.

However, there are times when C fails to meet some requirements of a programmer. An example has already been put up in my last post, where a integer variable needs to be rotated bitwise. Or to perform particularly arithmetic or logical shifts to a variable, since the bitwise right shift operation in C (>>) is implementation specific.

This is where inline assembly saves the day for a C programmer. A few lines of assembly code can be embedded effortlessly in a C program to carry out specific architecture dependent tasks. The basic format for inline assembly is

__asm__ volatile ("your first assembly statement here\n\t"
"your second assembly statement here\n\t")

Of course I could have used the keyword asm instead of __asm__ , and the volatile keyword specifies that since I am using inline assembly, I know exactly what I intend to do, so no optimizing tricks, please!

So, now I can go for a logical shift to 0xAAAAAAAA to obtain 0x55555555, and I can be sure that I get what I want. A small code for this can be written as

1:   #include <stdio.h>  
2:     
3:   int var1;  
4:   int var2;
5:  
6:   int main(){  
7:           var=0xaaaaaaaa;  
8:           __asm__ volatile        ("pushl %eax\n\t"  
9:                                    "pushl %ebx\n\t"  
10:                                   "mov $var1, %ebx\n\t"  
11:                                   "movl (%ebx), %eax\n\t"  
12:                                   "shrl $1, %eax\n\t"
13:                                   "mov $var2, %ebx\n\t"  
14:                                   "mov %eax, (%ebx)\n\t"  
15:                                   "popl %ebx\n\t"  
16:                                   "popl %eax\n\t"  
17:                                  );  
18:    
19:          printf("%p  after 1 bit logical shift = %p\n",(unsigned char *)var1, (unsigned char *)var2);  
20:  }  

and the output is certainly


Needless to say, the size of the main function is increased significantly due to this. Much can be accounted to the push and pop statements (which are required to save the registers to be used in the inline assembly) and moving the data (0xAAAAAAAA in this case) to and from memory. Also, the variable var was declared global for making it accessible to the assembly code.

To get around these shortcomings, we generally use extended assembly, and the same code can there be written as

1:   #include <stdio.h>  
2:       
3:   int main(){   
4:           int var1=0xaaaaaaaa,var2; /*local variables now*/
5:   
6:           __asm__ volatile        ("movl %0, %%eax\n\t"   
7:                                    "shrl $1, %%eax\n\t"  
8:                                    "movl %%eax, %1\n\t"  
9:                                    :"=r"(var2):"r"(var1):"%eax"  
10:                                  );   
11:      
12:          printf("%p after 1 bit logical shift = %p\n",(unsigned char *)var1, (unsigned char *)var2);   
13:  }

to obtain


What was the code, anyway? Well, lets explain it in bits and pieces.


  • First of all, the %0 and %1 are placeholders. %0  is filled in from input variable var1 which will be stored in a register, as denoted by the constraint "r" preceding it. 
  • %1 placeholder represents an output variable in register (denoted by "=r") and will be placed in var2. The registers to be used for storing the variables is decided in compile time (in my case it was edx). 
  • In the entire operation, only the value placed in eax will be tinkered with, and hence it is placed in the list of clobbered registers. This instructs gcc not to use eax in other parts of the code, or to save the value before entering and restore the value after leaving the assembly code.

The disassembly of the main function looks like this

1:       pushl %ebp  
2:           movl        %esp, %ebp  
3:           andl        $-16, %esp  
4:           subl        $32, %esp  
5:           movl        $-1431655766, 28(%esp)  
6:           movl        28(%esp), %edx  
7:
8: movl %edx, %eax 9: shrl $1, %eax 10: movl %eax, %edx
11: 12: movl %edx, 24(%esp) 13: movl 24(%esp), %ecx 14: movl 28(%esp), %edx 15: movl $.LC0, %eax 16: movl %ecx, 8(%esp) 17: movl %edx, 4(%esp) 18: movl %eax, (%esp) 19: call printf 20: leave 21: ret

Hope this motivates you towards inline assembly. For a complete understanding, please read here. For a complete reference to x86 instructions and opcodes, check this out.

Monday, January 9, 2012

What is volatile ...

Of all the keywords in C, the one which intrigued me the most was this: volatile !! Most of the sources instructed use of volatile when a variable is suspected to be changed outside the context of a program. According to the K&R book:

The purpose of volatile is to force an implementation to suppress optimization that could otherwise occur. For example, for a machine with memory-mapped input/output, a pointer to a device register might be declared as a pointer to volatile, in order to prevent the compiler from removing apparently redundant references through the pointer.

Well, that was absurd! How can a variable change in a sequential program? Needless to say, little did I know how the computer hides a lot of details beyond the visibility of an user, or a user-level programmer !

So, finally I set foot to solve this mystery with inline assembly. Well, the story goes like this:

C language does not have an operator or a built-in function to perform bitwise rotations on variables, though a function for this can be easily written like this. However, as x86 architecture has a machine instruction to rotate bitwise, why not use that? All I needed was to insert an inline assembly in my C code. If you are not familiar with inline assembly, refer this.

So, I wrote a code like this:

1:   #include <stdio.h>  
2:    
3:   int main(){  
4:           int count;  
5:           int variable_one,variable_two;  
6:           variable_one=20;  
7:           variable_two=2;  
8:           __asm__("mov $0x80000001, %edx\n\t");  
9:           for(count=0;count<16;count++){  
10:                  variable_one=variable_one*variable_two;  
11:                  __asm__("roll $1,%%edx\n\t");  
12:          }  
13:          variable_two=variable_one+2;  
14:          printf("variable_one = %d\n",variable_one);
15:  }   

So, this code rotates edx 16 times so that at the end of the loop edx contains 0x00018000 (0x80000001 rotated left 16 times). Also, in the same loop, it multiplies 20 in  variable_one with 2 in variable_two to obtain 20 x (2^16).  Lastly, the value of 1310720 (= 20 x 2^16) contained in variable_one is printed. [Please understand that the loop was completely unnecessary and this code is particularly for demonstration purpose]


So, I compiled the program and ran it, just to get an erroneous result. To see what went wrong, I disassembled the code, (use gcc -S to get the disassembly of a C code) which is given below:

1:  .LC0:  
2:           .string        "variable_one = %d\n"  
3:    
4:  ... Some Text Removed ...  
5:    
6:  main:  
7:           pushl        %ebp  
8:           movl        %esp, %ebp  
9:           andl        $-16, %esp  
10:          subl        $16, %esp  
11:    
12: mov $0x80000001, %edx
13: 14: movl $20, %edx
15: movl $0, %eax 16: .L2: 17: addl %edx, %edx 18:
19: roll $1,%edx
20: 21: addl $1, %eax 22: cmpl $16, %eax 23: jne .L2
24: movl %edx, 8(%esp) 25: movl $.LC0, 4(%esp) 26: movl $1, (%esp) 27: call __printf_chk 28: leave 29: ret

The two inline assembly lines entered in the code have been highlighted in gray. The for loop has been highlighted in light blue. It can be seen that, during the loop, the value of variable_one is held in edx and the inline assembly is tinkering with that in the next line. gcc is not to be blamed, as it is completely unaware that the assembly will change the value contained in edx, it does nothing to protect it.

Not to worry, as volatile saves the day ... I just changed line 5 in hack_asm.c to look like this:

1:   #include <stdio.h>  
2:    
3:   int main(){  
4:           int count;  
5:           volatile int variable_one,variable_two;  
6:           variable_one=20; 
7: 
8:  ... Some Text Removed ...
9:
10:          printf("variable_one = %d\n",variable_one);
11:  }   

and recompiled the code and ran it again. This time, it was perfect!


So what did happen back there? A dive into the disassembly could reveal the caveats

1:   .LC0:  
2:           .string        "variable_one = %d\n"  
3:    
4:  ... Some Text Removed ...  
5:    
6:   main:  
7:           pushl        %ebp  
8:           movl        %esp, %ebp  
9:           andl        $-16, %esp  
10:          subl        $32, %esp  
11:          movl        $20, 28(%esp)  
12:          movl        $2, 24(%esp)  
13:    
14: mov $0x80000001, %edx
15:
16: movl $0, %eax 17: .L2: 18: movl 28(%esp), %edx 19: movl 24(%esp), %ecx 20: imull %ecx, %edx 21: movl %edx, 28(%esp) 22:
23: roll $1,%edx
24: 25: addl $1, %eax 26: cmpl $16, %eax 27: jne .L2
28: movl 28(%esp), %eax 29: addl $2, %eax 30: movl %eax, 24(%esp) 31: movl 28(%esp), %eax 32: movl %eax, 8(%esp) 33: movl $.LC0, 4(%esp) 34: movl $1, (%esp) 35: call __printf_chk 36: leave 37: ret

A quick view reveals that the code inside the loop has increased significantly due to the introduction of the word volatile. the keyword volatile told gcc that the variables variable_one and variable_two  can be changed and should always be saved and reloaded. So for each iteration, the value of variable_one is loaded from memory, multiplied and is put back to the memory. Though this increases program size and execution time significantly, the risk of the value being altered is avoided.

So it is advisable to use volatile keyword with variables when using inline assembly. Or no, there's another way? Well, we will deal with it next.

P.S. Your feedback is my most valuable reward. Regards, Subhajit