The Magic of Linux

Thursday, February 16, 2012

The Power of Parallelism - Part III

In the last post we talked about multiprocessing, where a parent process forked children to work in parallel. However, creating multiple processes to share the load for a common task is not worth the effort, as switching between processes is a highly performance limiting operation for the operating system. You may sometime dive deep into the kernel source to see all the operation a scheduler needs to do while switching between processes.

In order to make parallel computing more efficient, lightweight processes called threads can be created within a scope of a process. Since threads are multiple workers running in parallel within a process, they share the logical addressing space, heap, memory maps, even the open files in the process. Thus, to switch between threads, only the stacks and the respective instruction pointers are to be swapped, which seems to be a much less overhead as compared to switching between separate processes.

However, it should be kept in mind, that threads run in parallel and independent of each other, working over shared global data, and hence synchronizing their operations is equally important. Just as multiple processes use semaphores, threads can, too! And what's more fun is that the semaphores don't need to be in shared memory, as all threads can access the global semaphores.

To work an example out, lets solve the problem stated in the last post using threads. To create threads, we just need to call

pthread_create(&newThread, &threadAttribute, operatingFunction, argumentToFunction);

Here, threadAttribute is a pthread_attr_t variable which decides the behavior of the newly created thread. Now, how do we tell the thread what to do after it is born? Well, not much; it just has to execute the code in the function named operatingFunction which takes a void * argument and returns a void * value. Just because the parameter and the return value are both void pointers, any type of data can be sent to and received from the thread. Here argumentToFunction is the argument we send to the thread. Once the thread is created, its ID is sent back in newThread so that it can be controlled and monitored. More reference of thread related functions can be found here.

The worked out solution to the problem discussed in the last post can be found here. One thing to be noted is that between multiple trials to lock the semaphores, the thread relinquishes control of the CPU by using sched_yield. This stops its continuously hogging the CPU polling the lock, and thus provides chance to other threads to finish their work and eventually unlock the locks. I would suggest you comment the two lines containing sched_yield, compile and execute to check out the drastic fall in performance.

I hope now its clear how easy it is to use threads to improve response time of programs. Here you can find a worked out (and well commented, as well) multi-threaded solution to counting occurances of a specific byte in a file. The screenshot below might interest you to have a look into it ...

N.B. To all who had a gaze into the above program, its pretty awkward to reserve memory to store the entire file. Its equally awkward to read such huge chunks of data in a single call to read. You might find it interesting to make the program a bit more savvy ... remember the synchronizing issues, of course!

Well, that's all for this series of posts related to parallel processing. I hope they prove to be helpful. Keep reading for more ...

Wednesday, February 15, 2012

The Power of Parallelism - Part II

So, we have all dealt with processes, at some time or the other. Even the browser we are reading this in, is a process in itself. If we have a look at the output of ps command, we can find multiple processes running in a system at any given time. We have learnt to write programs which compiles into an executable and then becomes a process when launched. In the last post, we have also talked about launching an executable file to create a new process.

In this post, we will see how we can create processes on the fly. In fact, it's quite an easy thing to do. anywhere in your program write this

1:   int pid;
2:
3:   pid=fork();
4:   if(pid==-1){
5:           /*The system failed to spawn a new process*/
6:   }
7:   if(pid==0){
8:           /*This part is executed by the child*/
9:   }else{
10:          /*This is parent's job to do*/
11:  }

fork is a system call, and when called from a process, dives deep into the kernel to create a copy of the calling process and return to both of them. If the process creation fails, then it can return only to the parent, and hence returns -1. If, however, the child is born successfully, there are now two processes to return to! So it returns 0 to the child, and the child's process id to the parent.

Now, being independent of each other, the two process can finish their job at any time and exit. But if the parent process is really concerned about the child's exit status, and must wait to clean up anything its children might have messed up with, it can do so with a call to wait. If any child forked by the caller is running, the system call wait makes the caller sleep until it exits. If no such child exists, the call returns -1 immediately.

A much awaited query follows the discussion inevitably ... now that we have two processes, can they talk to each other? Well, of course they can! Inter process communication systems (ipc, in short) saves the day. Two processes can share data using sockets, message queues or shared memories, and in this post we will deal with ipc using shared memory.

A process can create a shared memory using

int shmid = shmget(key_t key, size_t size, int shmflg);

which returns the ID of the newly created shared memory segment, or -1 if it fails. Such a shared memory created can later be destroyed by

shmctl(shmid, IPC_RMID, NULL);

Once a shared memory segment is created, the processes need to attach to it, to be able to use it. A call to shmat does this. Similarly, shmdt detaches the virtual memory area corresponding to the shared memory segment from the memory map of the calling process.

Having said all this, its time to bring all these into action. We can think of a problem, where there is a shared variable, and each process monitors it. They check the value and go to sleep. If the find the value unchanged after waking up, they stay happy. Otherwise, hard luck! Writing a program to do that should not be that tough, right? So here it goes ....

As it is clearly seen, all the children can read and write the shared memory just as expected, and that leads to yet another problem ... synchronizing their activities. This is required as all of them are now working on a common data, and its quite possible for one to overwrite data that is being used by another.

So, how do we synchronize the access to the variables? Nothing much, we add two semaphores to protect the two variables. Semaphores can be thought of as counters, which count down to zero when locked (or waited upon), and once the value reaches zero, the caller is blocked unless someone else unlocks (or posts) to increment the value.

We tune the code, therefore, so that each process locks the semaphores first (well, a small amount of rest in between the two locks, the reasons for which will become clear shortly) and then go to sleep. Upon rechecking the value after waking up, the semaphores are unlocked, so that others can try their feat, too. However, which semaphore to lock first, is decided at random. The modified code here was tried out a few times

What happened now? Both the children were happy in the first case, while in the second case, when 10 children were created, they just hung! We had to kill the system!

This is because, because of the randomness in the order of locking and the sleep in between, a situation might have occurred, where a child acquired prot_one semaphore and requested prot_two, while someone else held prot_two while asking for prot_one! A deadlock!

Well, the writers of real-time libraries must have thought of this in advance, and so we do have a workaround ... a sem_trywait! It is not so aggressive as sem_wait and hence does not block if the semaphore is already blocked by someone else. Using this trick, we rewrite the code once again

Voila! it works perfectly now! All the children are finally happy ...

So, in this part, we have discussed creating children, sharing data and synchronization ... the keys required for parallel processing. But creating processes to carry out multiple tasks is an inefficient way of doing things, because, switching between process takes up a lot of processing power. As we will see in the next post, there is another lightweight way of doing this ... so keep waiting for threads to take over ...

Sunday, February 12, 2012

The Power of Parallelism - Part I

It would not have been nice if we were still living in the days when computers could do only one thing at a time. Most of us enjoy maintaining a messy taskbar, full of text editors, web browsers, file system explorers and what not. And the world of parallel computing does not seem to limit itself to creating and running multiple tasks, but multiple parallel works inside a single task. Don't you enjoy the Tabbed Browsing features of Firefox (or Chrome) ?

So I decided to write a trilogy of posts which cover the basics of multitasking. This is the first in the series which talks about spawning a process from another. Well, the idea is to demonstrate what a shell (e.g. bash) does when we type in a command. In here, we will write a program which will launch gcc to compile a C program, and if compiled successfully, will launch the output executable.

Ok, to begin with, I wrote the C source code which will be compiled. It just needs to be a hello world sort of a code, so I just typed the following down and named it sample.c

1:   #include <stdio.h>
2:
3:   int main(){
4:           int data=0xdeadcafe;
5:
6:           printf("This is the proof that I was executed: %x\n",data);
7:           return 1;
8:   }

Alrite, now we need to cook something up, which can launch gcc to compile this. This can be done by the system call execve. The syscall execve launches an executable when the path to the executable, and the input arguments to the same, are passed as parameters to it. But, there is a catch ... execve never returns if it successfully launches the new task. So, we need to call fork before we call execve [We will deal with fork in the next post]. To launch gcc, therefore, we need to do something like this

1:   char *argv_to_gcc[]={"gcc","-o","sample.out","sample.c",NULL};
2:   int pid,status;
3:   
4:   pid=fork();
5:   if(pid){
6:           wait(&status);
7:   }else{
7:           status=execve("/usr/bin/gcc",argv_to_gcc,envp);
8:           return status;
9:   }

The return value from gcc will be present in status, which we can check to decide whether the executable sample.out was built or not. Note that the return value can be obtained by examining bits 8-15 in status. If gcc returned 0, we launch sample.out like this

1:   char *argv_to_output_executable[]={"sample.out",NULL};
2:   int pid,status;
3:
4:   pid=fork();
5:   if(pid){
6:           wait(&status);
7:   }else{
8:           status=execve("./sample.out",argv_to_output_executable,envp);
9:           return status;
10:  }

Putting it all together, the program launcher.c was written to do all this.

I hope that makes clear how shells work for us to launch programs. In the next post, we will talk about spawning processes using fork and how to synchronize the parallel operations on shared memory.

Thursday, January 26, 2012

Compiling a Kernel

Any kernel-compile by a starter is not its worth unless you can see the changes for yourself. Really speaking, how many of us could really detect the changes from kernel version 2.6.xx that were brought into 2.8.xx [of course you are not allowed to peek into the change logs]? So I decided to edit a kernel source code and compile it, so that the changes are really visible to the user space.

We should be aware that any user space program, upon completion calls library function exit(), directly or indirectly, so that the kernel can spew it out of the list of processes. This function, is actually a wrapper around a system call sys_exit. So, how about we edit the function sys_exit so that it prints a message at the controlling terminal of the calling process? Now, wait! Not every process has a controlling terminal, right? Like daemons, and tasks like init? Well, they can print their goodbye messages in the system log, for that!

Having said all that, we now need the kernel source tree. But which one, then? There are loads of them ... Well, the safest way is to get the one nearest in version to your existing kernel. So, I just took a look at the kernel version of my system:

$ uname -r
2.6.32-32-generic

and went about to download a 2.6.32.xx kernel from here. The download will be a tarball of the source tree compressed with one of gzip, bzip2 or xz algorithms, so you can download any one of the .tar.gz, .tar.bz2 or .tar.xz files.

Once this was done, the source tree archive was extracted. This can be done by the traditional tar command as shown below

or in these days of X and gnome (or any other desktop manager), there will be many archive manager GUI tools out there to do that for you.

Ok, now we edit the function sys_exit(). It can be found in kernel/exit.c, well not exactly as sys_exit(), but as SYSCALL_DEFINE1(exit, int, code)! Any SYSCALL_DEFINEx's you find here are mere preprocessor directives [defined in /include/linux/syscalls.h] to make the code more readable.

Note
If you really want to dig into the kernel code to see how it works, make friends with grep -R.

We can see that sys_exit() does most of the useful work in do_exit(). Lets insert this bit of code there at the start of do_exit()

1: char go_down_msg[64+TASK_COMM_LEN];
2: struct tty_struct *tty_curr=get_current_tty();
3: if(tty_curr==NULL){
4:         printk(KERN_ALERT "%s going down with code = %ld. no tty found.\n",current->comm,code);
5: }else if(tty_curr->ops==NULL){
6:         printk(KERN_ALERT "%s going down with code = %ld. no tty operations found.\n",current->comm,code);
7: }else if(tty_curr->ops->write==NULL){
8:         printk(KERN_ALERT "%s going down with code = %ld. no tty write operation found.\n",current->comm,code);
9: }else{
10:        snprintf(go_down_msg,64+TASK_COMM_LEN,"\r\n%s going down with code = %ld.\r\n",current->comm,code);
11:        tty_curr->ops->write(tty_curr,go_down_msg,strlen(go_down_msg));
12: }

Now we are ready to compile our new kernel. We open the Makefile and edit the first few lines to name our new kernel linux-2.6.32.27-sys_exit-modified. Also we need to configure our new kernel before we build it. The easiest way to do this will be to copy the configuration file of our existing kernel and use it.

Note
make oldconfig may ask for your choices for a few cases, and its better to google for a safe choice!

The next few steps are quite simple. We need to build the kernel now. And then install it, and the accompanying modules. So, let's do it. And yes, these are going to take quite some time; so let's get some popcorn!

$ make
$ sudo make modules_install
$ sudo make install

Now got three files in /boot directory:

vmlinuz-2.6.32.27-sys_exit-modified
System.map-2.6.32.27-sys_exit-modified
config-2.6.32.27-sys_exit-modified

and the modules in /lib/modules/linux-2.6.32.27-sys_exit-modified/

Exited? Well, we are almost there to boot into our new kernel. Just a few more steps ... We need to create a ramfs file system for our kernel to boot. this is required to help the kernel with the initial stages of booting where the modules required for accessing the file system has not yet been loaded. After that, the grub menu needs to be updated so that we can boot into our new kernel. All these can be done by

$ cd /boot
$ sudo mkinitramfs -o initrd.img-2.6.32.27-sys_exit-modified 2.6.32.27-sys_exit-modified
$ sudo update-grub

Now, we can reboot and start our new kernel. And if you get to see a lot of messages rolling down the console, you have made it successfully! Once inside, we get to see our new kernel in action. So just open a terminal, and enter any command. And well, I got this! "ls going down with code = 0."! that was real nice a feeling!

Well, Thats all for now ... Keep exploring the kernel source tree!

Saturday, January 21, 2012

A Loadable Kernel Module ...

This is intended to be a motivation towards writing Loadable Kernel Modules (LKMs) and the power they can equip the user with, under the hood of course. The module itself has been kept as simple as possible, while codes for error checking and synchronization have completely been stripped off. You can definitely try this at home ... while learning the other important aspects all along.

To start with we write a simple C program named edit_passwd.c, which tries to open the file /etc/passwd in write mode. Provided the program is run under non-superuser privilege, it is sure to fail, as the file can be opened only by root. You can easily verify this

So, when edit_passwd was executed, we got something like this.

Needless to say, we did not insert any module before pressing the key.

Now, we can write a module which would change the credentials of this task so as to enable it to open this file. A sample module which did this for me can be found here. Now it was time to compile this module. To do this, we create a Makefile [To know about makefiles and make utility, check this out] in the same directory as the module source file. The sample makefile can be downloaded from here. Once this is done, we can build the module by saying

$ make

Now, its time to see this module in action. We start our demo program edit_passwd, and while it waits for the user to press a key, we insert the module by saying

$ sudo insmod module_demo.ko

and then press the key. And what! The program now opens the protected file!

So, what is it that we do in our module that empowers the task? Well, lets look into the module's init function hello() [A init function is called when the module is loaded] a bit.

When it is being inserted, it gets the pointer to the task_struct in current. Then it iterates over the list of all tasks to find the task named edit_passwd.
Once the task is located, it changes fsuid element in the cred and real_cred structures. This element is checked by kernel for verifying privilege while doing VFS operations, i.e., file handling. Since it has been changed to the uid of root, any file which can be opened by root can now be opened by the task.

I hope you could see the catch by now. Everytime we need to do this for a file with a different name, the module will need to be recompiled. All these could be avoided if the task could directly communicate from user space to the module in kernel space. Well, not so tough, though ... we can do this by sysfs or procfs interfaces. Let's leave it to the readers, rite? And yes, this will help you for sure.

Wednesday, January 18, 2012

Function on the fly ...

When I was undergoing a course on Calculus in college, I often felt the urge to plot the graphs of various functions, just to check them out how they look like, to have a feel. So, back then I wrote a small program which evaluated the values of y for a set of values of x and then plotted them. The mathematical expression for y was written into a C function which was called by main() while iterating over the values of x.

But soon, it started to irritate me! Because, every time I wanted to do this for another expression, I had to rewrite the function, compile and link it! But I had no choice. The least I could do was to write a parser which read the expression from the input, build a parse tree (dont know much about parsing? check it out here), and then recursively traverse that to evaluate y. Well, no! I wanted something in machine code. Yes, you got it; I wanted a mini-compiler inside my program, which would read an expression string and create a function online to be called again and again.

The rest of this post illustrates an example of how to do it. To keep it short and simple, a binary operation will be parsed. Also, this example will be absolutely non-portable. That means, in short, that it will not even compile on a non-x86-non-unix machine. Well, casting it to suit a particular architecture/platform wont be very hard, I believe! For reference to x86 instructions, refer here.

And one more thing! This is going to be extremely dirty down there. Please hold on to it with patience (and this).

The first thing we require is allocate memory for the function that will be generated. Since we are going to use mprotect() to obtain write (and then read/exec) permissions to this memory, its better to allocate an entire page for this. posix_memalign() will do that for us.

#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>

void *ptr;
posix_memalign(&ptr, getpagesize(), getpagesize());
mprotect(ptr,1,PROT_WRITE);

The next job will be to write the function in the memory pointed to by ptr. create_function() will do it when called. We will deal with this function sometime later. After we manage to do that, the page should be made read and execute only and assigned to an appropriate function pointer.

char operator; //this can have '+' or '-'int (*operation_function_created_online)(int, int);

create_function(ptr, operator);
mprotect(ptr,1,PROT_READ|PROT_EXEC);
operation_function_created_online=ptr;

This done the function can be called just like any other C function.

#include <stdio.h>

int result,val1,val2;

result=operation_function_created_online(val1, val2);
printf("operation %s = %d\n", expr_string, result);

Okay, take a break, get a beer! The next part tells what create_function() does for us.

Coming back, a C function of the type as declared here

int function(int val1, int val2)

will be called like this:

the arguments are passed in stack right to left. so val1will be pushed and then val2 will be.
The function is called, so the next element in the stack is the return address.
The function, in its first instruction, will push ebp to stack, to save its value, so that the stack can be addressed using this register.
The return value is sent back in eax.
The original value of ebp is popped back before returning.

So, the typical caller-callee instruction sequence is something like

Callee
Line number Opcode Instruction
1 55 pushl %ebp
2 89 e5 movl %esp, %ebp
... rest of the function ...

... store result in EAX ...

8 5d popl %esp
9 c3 ret


Caller
 ... some text removed ...

6:   mov op2, %eax
7:   push %eax
8:   mov op1, %eax
9:   push %eax
10:  call Callee
 ... do something with EAX ...

 ... some text removed ...

Line number	Opcode	Instruction
1	55	`pushl %ebp`
2	89 e5	`movl %esp, %ebp`
... rest of the function ... ... store result in EAX ...
8	5d	`popl %esp`
9	c3	`ret`

In our case, we need to have a function like this

1:  push %ebp  
2:  movl %esp, %ebp  
3:  movl 8(%ebp), %eax  
4:  movl 12(%ebp), %eax  
5:

6:  ;do the operation here to store result in EAX

7:    
8:  pop %ebp  
9:  ret

All that we need to do in create_function() is to fill the memory with these instructions and the intended operation. Once we are done, the program is ready to compile our expression. A sample program can be downloaded from here (modified to parse all elementary binary operations). The output from this program looks like

This is a very simple illustration of how a compiler works. Hope this motivates the reader towards further studying compiler design in depth.

Monday, January 16, 2012

Memory Protection

It was a lazy afternoon when I was reading a book on Vedic Mathematics. So I asked a friend sitting next to me:

"Hey, how fast can you add 123456789 and 987654321?"

Clever as he is, he quickly opened the calculator application in his desktop, and started typing. To that, I screamed:

"Hey, you are not supposed to use calculators! And anyways, do you think your computer is always right?"

"Well, of course! As far as calculations are concerned! And its just an addition!"

"Ok, how much for my computer adding 2 and 2 to give 5? or 6? or anything I want?"

"You can bet a 1000 bucks if you want! I just love easy money! :P"

So, I started off with it. I wrote a program to add 2 and 2 like this

1:   /*Listing of boozed_desktop.c
2:     It adds 2 and 2 and then faints*/
3:
4:   #include <stdio.h>
5:
6:   void do_magic(void);
7:   
8:   int main(){
9:           do_magic();

10:          printf("I want to print 4, and printed %d\n",2+2);
11:  }

Ok, so it adds 2 and 2 in the printf() statement. That should convince him. And that means I have to write the function do_magic() for the real trick.

Well, to start off, every operation must be having in its operands somewhere stored in memory. My job should be to change the value of any one of the two operands, so that the addition gives a result I want it to produce. To do that, I need to check where in memory are the two operands (the two 2's) stored. Well, a disassembly can do that for me. That's how the function main() looks when disassembled.

1:          pushl       %ebp  
2:          movl        %esp, %ebp  
3:          andl        $-16, %esp  
4:          subl        $16, %esp  
5:          call        do_magic  
6:          movl        $.LC0, %eax  
7:          movl        $4, 4(%esp)  
8:          movl        %eax, (%esp)  
9:          call        printf  
10:         leave  
11:         ret

So, at line 5, I see do_magic is being called, and in line 6, .LC0 which must be containing the string literal "I want to print 4, and printed %d\n" to be passed to printf(), is being stored at eax. But I could not find any add operations in there. Well, then I realized that gcc must have added them automatically and put that statement

movl $4, 4(%esp)

at line 7. So, now I have to directly change this value inside the function do_magic()! But, how? How am I supposed to know the address of this movl statement in there?

Well, of course I can! When do_magic() will be called, the address to the next instruction will be pushed to the stack. Then, the first instruction in that function will push the value of ebp in the stack to save its contents. Next, the value of esp will be stored in ebp. Confused, how? Well, That is what every function is compiled to do. Check out lines 1 and 2 in the disassembly of main(). That says it, right?

So, after the first two instructions, ebp will contain the address of the beginning of the stack, where previous value of ebp is saved. Just above that (yes, stack grows downwards) is the address of line 6 of main(), that is the address of

movl $.LC0, %eax

Once I know the size of this instruction, I can calculate the location of my desired statement. Now, how can I find the size of this instruction? Ok, lets compile this code and get the hex dump of the object file. This can be done as

$ gcc -c -o boozed_desktop.o boozed_desktop.c
$ objdump --disassemble boozed_desktop.o

And the object dump will look like this

The highlighted line is the one we are looking for. So, its 5 bytes in length. And not just that, the next line is where we need to do the change. That instruction is of length 8 bytes. Now, a little intuition says that 0xC7 0x44 0x24 0x04 is the opcode to move an immediate value to the address specified by 4(%esp). the trailing 0x04 says it all. So, the "4" we need to change is written as a 32 bit int value in little-endian format as 0x04 0x00 0x00 0x00. That's it! So, summing it all up, do_magic has to do these in order:

Get the return address from stack by using ebp.
Add 5 bytes + 4 bytes to that address to get the address of "4".
Modify it!

Is that all? No, not quite ... Actually, the memory region we are going to modify is holding the executable code. So this memory page must be read-and execute-only page. And we need to have the write permissions to this page. No problem! mprotect() can do that for me. And for that, we will send mprotect() the address of the starting location of this page [which is done by address & (~(getpagesize()-1))]. This will be done by

mprotect((unsigned char *)((unsigned int)address & ~(getpagesize()-1)),1,PROT_READ|PROT_EXEC|PROT_WRITE);

Oh, you got to notice that! mprotect() was asked to grant all permissions on this page. This is because if do_magic() is in the same memory page as main() [which it will usually be], with only the PROT_WRITE permission (and hence PROT_READ and PROT_EXEC revoked), the call to mprotect() will never be able to return!

Okay, saying all that, I just wrote it down to a function, and added a bit of salt and sugar, to cook up this up

1:   #include <stdio.h>  
2:   #include <unistd.h>  
3:   #include <sys/mman.h>  
4:    
5:   void do_magic(void);  
6:    
7:   int main(){  
8:           do_magic();  
9:           printf("I want to print 4, and printed %d\n",2+2);  
10:  }  
11:    

12:  void do_magic(void){  
13:          int *var;  
14:          unsigned char *retloc;  
15:          int number;  
16:    
17:          printf("How much do you want \"2+2\" to print?\n");  
18:          scanf("%d",&number);  
19:    
20:          __asm__ volatile("mov 4(%%ebp), %0\n\t":"=r"(retloc):);  
21:          mprotect((unsigned char *)((unsigned int)retloc & ~(getpagesize()-1)),1,PROT_READ|PROT_EXEC|PROT_WRITE);  
22:          *((int *)(retloc+(5+4)))=number;  
23:  }

Well, that should be enough to irritate my friend! Sure he will fire up when he gets to see this:

Now, just in case, if somebody is thinking that these can only be used to goof up with systems, in the next post, we will see how such memory protection schemes can be put to some good.