A Multi-thread & Multi-process safe shared library is using a shared Semaphore/Mutex/Spinlock for entering into a critical section. This shared library will be linked by many applications(processes) in the system.
The application crashes after entering into the critical section of this shared library.
Eventually all other applications which waits for this shared lock will be waiting indefinitely, hence system will go into a Panic mode!!!
How to recover from/avoid this kind of situation in Linux?
I have faced similar problem in past with using POSIX Semaphores. I have shared the named semaphore across process where the process crashed by segmentation fault and the system didn’t went for reboot. I have solved it using file locks. Find the below details.
Linux has two types of Semaphore,
- System V Sempahore
- POSIX Semaphore
System V semaphore:
SysV semaphore has it’s inbuilt capability to recover from this problem by specifing “SEM_UNDO” in “struct sembuf *sops->sem_flg”. This flag insists the kernel to undo/revert the changes to semaphore while winding the process both in abnormal and normal ternimation scenarios. Check out man pages of “semop(2)” system call.
POSIX Semaphore’s doesn’t the option to undo/revert the changes when the process is terminated. The obvious alternative to this is to find the resources which will freed/undone during process termination. One solution i have used is to replace the POSIX calls i.e. “sem_wait(P) and sem_post(P)” with “lockf(P)” which is record locking of file descriptors. As all the open descriptors are released during process (abnormal and normal)termination, i opted this solution.
Note: This is one solution towards semaphore which i have opted
How come this SysV alone have such liberty?
As stated above, this is system call. Which means, the big boy i.e. Kernel internally handles the lock not the user space library. So when the process is signalled for termination(Note i mentioned signalled. Yes, process termination is handlled as part of Kernel signal handlling path), i has its workflow to clean all the resource before returning to userspace.
Again considering two mutex cases in POSIX.
1.Mutex within the process
2.Mutex Shared across the process.
Mutex within the process: If the process is going to be killed and the lock is not shared. Never bothered, let it be.
Mutex shared across the process:
This can be done using “pthread_mutexattr_setpshared(P)” which will set the PTHREAD_PROCESS_SHARED flag. This mutex can be controlled with “robust” paramater/flag for release when termination. This can be done using “pthread_mutexattr_setrobust(P)” with the flag “PTHREAD_MUTEX_ROBUST” (opposite to this is PTHREAD_MUTEX_STALLED).
Sorry, i have never used this shared mechanism in my past. The above info is just theoritical. Look for man pages for better usage info.