What is the best way to ensure synchronization and memory consistency in the following scenario:
- Two processes are running on a multi-core machine
- The machine also has an RDMA-capable network (InifniBand)
- The two processes have read/write access to a window of memory in shared-memory. This window is also registered with the network interface. So, the data in the shared-memory region can be changed by either of the processes on the node and by the network.
Suppose, P1 is running on core0. P2 is running on core1.
P2 initializes its shared-memory window by doing a
memset(.., 1, size); P1 writes 0's into this shared-memory window through the network interface (RDMA, loop-back). I get the required network completion-event and as far as the network is concerned, the 0's have been written into the shared-memory. When P2 tries to read this memory region, it should now read "0", instead of "1". But, I am seeing a race condition here and this is not always the case.
I have tried the following cases:
- pthread-mutex:
1. I have a pthread_mutex in this shared-memory window, initialized with the attribute : PTHREAD_PROCESS_SHARED.
2. P1 locks the mutex and does the network operation. When process P0 gets the network completion event, it releases the lock.
3. P2 then locks the mutex and reads the data from the shared-memory
- memory barriers:
I have tried inserting the following memory barriers, just after P2 acquires the lock and just before it reads the data. It still doesnt seem to help.
__asm__ __volatile__ ( "lfence" ::: "memory" );
__asm__ __volatile__ ( "sfence" ::: "memory" );
__asm__ __volatile__ ( "mfence" ::: "memory" );
__asm__ __volatile__ ( "sfence" ::: "memory" );
__sync_synchronize();
Any suggestions as to how such a situation can be handled to ensure correctness? I am using GNU compilers, and the compute node is an Intel Westmere machine.
Thanks,
--K