Cross-platforming Windows Code

**SMurf** · 07-31-2012

Hello,

I'm working on a little multithreaded event dispatch... thingy. As it stands the code uses quite a lot of Win32 APIs, focusing on primitives such as Events.

The way that I am using this is for manual-reset events to signal to a thread that an input queue needs attention. The thread loops through reading the queue, then resets the event so that it knows not to look at it again until signaled. A collection of these events is continuously waited on via WaitForMultipleObjects.

Unfortunately, I can't see a drop-in equivalent of this functionality on Linux. pthreads doesn't have an elegant solution to this either, bearing in mind that the adjustment of Win32 event objects is serialized (only one thread can make changes at any time).

I was hoping that this would be covered by a cross-platform library by now, but ho hum. I don't mind implementing the "core" of the system in a platform-specific manner if it means that it works as quickly as possible on that platform. But in case I missed something, can anyone else shed some light?

**stahta01** · 07-31-2012

I have never used the Boost thread library; but, it might do what you want.

Chapter 28. Thread 3.0.0 - Boost 1.50.0

Note: I really do not know enough to understand the info on the problem or on the link I posted.

Tim S.

**Shakti** · 07-31-2012

Couldn't you implement a somewhat similar behaviour fairly easy using a sempahore, a mutex and a queue?

WaitForMultipleObjects would be something along these lines:

1. Wait on the semaphore
2. Lock mutex (mutex guards the queue, thats it)
3. Pop head from queue (and store the result in some local variable)
5. Unlock mutex
4. Based on the value stored in the local variable do something (read a queue or whatever...)
6. Go to 1.

To post an event you would do something like this:

1. Lock mutex
2. Push the event to the back of the event queue
3. Post to the semaphore
4. Unlock the mutex

Now this has the added benefit that a semaphore is fairly easy to implement with only mutexes (lots of examples available) so you could go with something like boost.thread (or C++11) for full portability (or some C equivalent if that is more up your alley).

**SMurf** · 07-31-2012

I'm somewhat averse to using mutexes as they are fairly intensive under Windows (an application using this engine will require an average of 100 of these primitives).
I forgot to mention: no more than two threads will ever access the queues. That may make things easier.

**Shakti** · 07-31-2012

Not sure i'm following you completely; do you have 100 WaitForMultipleObjects or 1 WaitForMultipleObjects?

In my solution 1 WaitForMultipleObjects "emulation" would only require 1 mutex (I should clarify that in step 2 and 3 of my WaitForMultipleObjects thingy the queue I am talking about is simply a queue that stores jobs to be performed...could equally well be an array of booleans or something similar.), no matter how many different kind of events we are talking about here.

Regarding the mutex: you will still have to guard the queues somehow, and don't fall in the trap of premature optimization. Mutexes might be a little heavy (I don't know myself, I just take your word for it) but are they heavy enough to cause a problem? You might want to look into CRITICAL_SECTION for Win32 as an alternative to mutex. Another alternative found after some quick googling: Roll Your Own Lightweight Mutex

**SMurf** · 07-31-2012

Each thread waits on WaitForMultipleObjects like so:-

Code:

while ((dwRet = WaitForMultipleObjectsEx(nCount, lphEvents, FALSE, INFINITE, FALSE)) != WAIT_FAILED)
{
    // ...
}

Where lphEvents is an array of nCount event handles. There is only one of these per thread (like a main loop).

It does need to be something that the kernel understands (the thread should sleep until one of these primitives is signaled). I think that limits my options. Perhaps I should investigate semaphores...

**Shakti** · 07-31-2012

The semaphore in my example blocks ("sleeps") the thread until it is woken up by an "event" (which happens when you post an event) and only then. The reason it is a semaphore and not a mutex is so that if events are piling up (events are posted faster than they are dealt with (hopefully only during a brief period of time)) you still keep looping through the events and deal with them as fast as you can.

**SMurf** · 07-31-2012

I've done some reading up and I agree with the semaphore idea. This can be implemented cross-platform using pthreads. I guess I'll have to roll my own WaitForMultipleObjects by providing access to a single semaphore for each thread, waiting upon it, then scanning through the list of associated queues to find the one that was tripped.

My queues are lock-free however (they are a fixed size and simply return failure if they can't do something at any given instant), so I am likely to try an atomic primitive like __sync_lock_test_and_set (admittedly GCC-specific) than use a mutex to guard the queue. If the pointer isn't there (NULL returned) it gracefully fails.

Thanks for the brainstorming.

**Codeplug** · 07-31-2012

http://developers.sun.com/solaris/ar...aitfor_api.pdf
They use a subscription model so that there is no polling.

>> My queues are lock-free however ..., so I am likely to try an atomic primitive ... than use a mutex to guard the queue.
That doesn't make sense. If you have a "lock-free queue", why would you need additional synchronization to access it?

gg

**SMurf** · 08-01-2012

I'm mincing words. I meant an atomic intrinsic.

Out of interest, setting/clearing a single bit field is thread safe, isn't it?

**Shakti** · 08-01-2012

Read this on stack overflow: c++ - Thread safety and bit-field - Stack Overflow

**Codeplug** · 08-01-2012

>> Out of interest, setting/clearing a single bit field is thread safe, isn't it?
Here's a C++11 question on word-tearing and bit-fields: c++ - Does the C++11 memory model prevent memory tearing and conflicts? - Stack Overflow

I talk about word-tearing and post some links in this thread: mutex necessity

gg

**Elkvis** · 08-03-2012

have you considered using libwine to simply build your (mostly) unaltered windows code on linux? not sure if it would work for you, given the low level functionality you're exploiting, but it might still be worth a look.

**SMurf** · 08-03-2012

Wineing is a possibility, but it also ties me to x86. I don't think that I'm doing anything that's impossible on other arches, it may simply be faster/slower.
Saying that, I have sniffed out liblfds. Comparing code, I think my interpretation of a lock-free queue worked only in the specific way that I was using it, otherwise it was horribly broken.

Plus, the authors have been pre-brow-beaten by Codeplug!

**Codeplug** · 08-03-2012

You can implement efficient MT-queue's without the complexity and risk that comes with lock-free code. There is a windows example here: ReadDirectoryChangesW question
You can replace the std::deque with whatever suits your needs. Here is a boost::circular_buffer example: Boost thread request queue design question

I made some improvements to liblfds and submitted them to the author. There hasn't yet been a release since then so I figured I'd share my changes here. I can only speak for the freelist code - I didn't even look at the code for the other containers.

Originally Posted by gg_modifications.txt

The diffs aren’t pretty. A lot of the code went through a “beautifier” (macro run out of habit).
I didn’t hack any further than the freelist code, and just the freelist benchmark on the test side. Here’s the important stuff:

ReadMe.txt
- LFDS doesn’t actually support IA64. _InterlockedCompareExchange128 isn’t available for IA64 (I don’t think is has a true dcas instruction). There are references to IA64 on your Wikipedia too.

Inc\liblfds.h
- I wanted to consolidate all the “preprocessor know-how” into a single location – making it easier to recognize and implement various ports. I came up with the following standard defines:
Compiler (currently supported):
LIBLFDS_MSVC
LIBLFDS_GCC_UNIX - GCC on *nix (not Cygwin, not MinGW)

Environment (available headers, functions)
LIBLFDS_WINDOWS - PSDK or DDK environment
LIBLFDS_POSIX - Posix environment

Pointer Width:
LIBLFDS_32BIT
LIBLFDS_64BIT

Architecture (optional, add as needed):
LIBLFDS_X86
LIBLFDS_ARM
- I added inline support for abstraction_[dcac|cas|increment] under gcc. Part of that included marking then extern here. (more on inline’ing under gcc below).
- The FREELIST_COUNT_DCAS stuff was something I did for the freelist benchmark that allowed each thread to get a count of the number of dcas retries.

src\liblfds_internal.h
- Added include guards to headers that didn’t have em.
- I moved some things from liblfds.h to here that seemed like they didn’t need to be “publicly exposed” – ALIGN, INLINE, some headers.
- INLINE – This is used by abstraction_ [dcac|cas|increment]. The trick here is to get a publicly exposed symbol that external modules can link to, while having the compiler inline any internal calls to those functions (within the lib/dll/so). Under MSVC this is easy via “extern __forceinline”. Under gcc the easiest thing to do is to compile the inline functions as non-inline in their own translation unit, to be linked in with the lib/dll/so. But when the rest of the lib is calling those functions, they’re defined inline. To achieve this, abstraction_cas.c, abstraction_dcas.c, and abstraction_increment.c are all #include’d at the end of liblfds_internal.h. There is a new source file, abstraction_inlines.c, which just includes liblfds_internal.h with INLINE disabled. This is only compiled/linked under gcc.
Here’s the reference material I used for this:
http://gcc.gnu.org/gcc-4.3/porting_to.html
http://www.greenend.org.uk/rjk/2003/03/inline.html
- dcas_ptr – What I really wanted was a type that I could throw anywhere and know that it had correct alignment, without having to remember to use the ALIGN macro. This is what I came up with. All the dcas_ptr_x() macros are an ‘attempt’ to make typical usage a little more readable. Another advantage is that you can omit any packing on structures. The compiler just does the right thing since it knows the alignment requirements of ‘dcas_ptr’.

src\abstraction\abstraction_ [dcac|cas|increment].c
- Uses LIBLFDS_* directives.
- Added #error if none of the ports within the file are selected for compilation. This seemed easier than trying to make liblfds.h a catch-all for platforms which may not have everything ported to it.

src\freelist\freelist_internal.h
- USE_ORIG_FREELIST – I had that in there just make it easier to test the difference between the dcas_ptr freelist, the version 6 freelist.
- volatile isn’t needed on any of the freelist structure members. All “synchronization” is handled by dcas.
- I originally confused “element_count” as a dynamic count. So I renamed it to “capacity”.

src\freelist\freelist_pop_push.c
- dcas_ptr versions of push/pop
- atomic increment doesn’t scale well at all on x86, so I eliminated it. The trade-off is that both pop and push now perform a local, non-atomic increment within the DCAS loop. (In the end, I think there was only a small improvement in scalability on my Core i7 920. But may benefit other platforms even more.)

test\src\abstraction.h
- Added some new abstractions:
- thread_condvar_t – This behaves just like a Win32 event. It can be signaled or non-signaled and threads can wait, signal, and reset. The benchmark code uses this as a sort of barrier to ensure that all created threads are at the same point before beginning. The signaling of the condvar starts the benchmark.
- abstraction_thread_yield(ms) – Main benchmark thread uses this after creating threads to allow them to all to reach to the “starting point”.
- thread_timer_t – I added this because I wanted more resolution than clock(). It uses QueryPerformanceCounter() on Windows and clock_gettime(CLOCK_REALTIME) under Posix.
- (These need to implemented for DDK platform)

test\src\abtraction_thread_start.c
- Added thread priority to Windows, added thread affinity and priority to Posix (had to run via sudo under Linux for priority elevation)

test\src\benchmark_freelist.c
- My attempt at gathering and calculating metrics. Let me know if you see any bugs, or if I can clarify on what I thought I was doing.

makefile.linux
- Removed abstraction_cas.c, abstraction_dcas.c, abstraction_increment.c
- Added abstraction_inlines.c
- Added to base: “-pedantic -D_XOPEN_SOURCE=600 -D_GNU_SOURCE" (gnu source for sched_setaffinity)
- Added to release: “-O3 -Winline –DNDEBUG”
- Removed –s so I could get disassembly listings from gdb (to make sure stuff was getting inlined for real).

test\src\makefile.linux
- Added abstraction_timer.c, abstraction_thread_condvar.c
- Added “-pthread –lrt” (Prefer –pthread over –lpthread. “rt” needed for timer stuff)

The VS project/solution currently only does libs. But all the project settings should be good. You can tweak makefile.windows by examining the command line that the solution/project generates when compiling (particularly the optimization settings).

Here are some x64 benchmark results from my core i7 920 box running Win7:

Release 0 Freelist Benchmark #1
1 Threads, 30000000 Operations, Times:
CPU 0 = 536.06 ms, 55964.07 ops/ms
Total Time = 536.09 ms
Avg Thread Time = 536.06 ms
Total Ops/ms = 55964.07 ops/ms
Scalability = 100.00%
Throughput = [Baseline]

2 Threads, 30000000 Operations, Times:
CPU 0 = 3391.45 ms, 8845.77 ops/ms
CPU 2 = 3353.54 ms, 8945.78 ops/ms
Total Time = 3391.47 ms
Avg Thread Time = 3372.49 ms
Total Ops/ms = 8895.50 ops/ms
Scalability = 7.95%
Throughput Drop = 15.90%

3 Threads, 30000000 Operations, Times:
CPU 0 = 6977.94 ms, 4299.26 ops/ms
CPU 2 = 6869.21 ms, 4367.31 ops/ms
CPU 4 = 6955.57 ms, 4313.09 ops/ms
Total Time = 6977.97 ms
Avg Thread Time = 6934.24 ms
Total Ops/ms = 4326.36 ops/ms
Scalability = 2.58%
Throughput Drop = 48.64%

4 Threads, 30000000 Operations, Times:
CPU 0 = 10240.62 ms, 2929.51 ops/ms
CPU 2 = 10237.61 ms, 2930.37 ops/ms
CPU 4 = 10229.84 ms, 2932.60 ops/ms
CPU 6 = 10200.86 ms, 2940.93 ops/ms
Total Time = 10240.64 ms
Avg Thread Time = 10227.23 ms
Total Ops/ms = 2933.34 ops/ms
Scalability = 1.31%
Throughput Drop = 67.80%

5 Threads, 30000000 Operations, Times:
CPU 0 = 10320.27 ms, 2906.90 ops/ms
CPU 1 = 9942.31 ms, 3017.41 ops/ms
CPU 2 = 9283.74 ms, 3231.46 ops/ms
CPU 4 = 9302.25 ms, 3225.03 ops/ms
CPU 6 = 9313.67 ms, 3221.07 ops/ms
Total Time = 10320.30 ms
Avg Thread Time = 9632.45 ms
Total Ops/ms = 3114.47 ops/ms
Scalability = 1.11%
Throughput Increase = 106.17% [6.17%]

6 Threads, 30000000 Operations, Times:
CPU 0 = 15157.39 ms, 1979.23 ops/ms
CPU 1 = 15151.46 ms, 1980.01 ops/ms
CPU 2 = 15161.49 ms, 1978.70 ops/ms
CPU 3 = 15167.51 ms, 1977.91 ops/ms
CPU 4 = 8447.68 ms, 3551.27 ops/ms
CPU 6 = 8491.80 ms, 3532.82 ops/ms
Total Time = 15167.53 ms
Avg Thread Time = 12929.55 ms
Total Ops/ms = 2320.27 ops/ms
Scalability = 0.69%
Throughput Drop = 74.50%

7 Threads, 30000000 Operations, Times:
CPU 0 = 19867.78 ms, 1509.98 ops/ms
CPU 1 = 19700.22 ms, 1522.83 ops/ms
CPU 2 = 20145.11 ms, 1489.20 ops/ms
CPU 3 = 20149.48 ms, 1488.87 ops/ms
CPU 4 = 20174.15 ms, 1487.05 ops/ms
CPU 5 = 20185.63 ms, 1486.21 ops/ms
CPU 6 = 7891.37 ms, 3801.62 ops/ms
Total Time = 20185.65 ms
Avg Thread Time = 18301.96 ms
Total Ops/ms = 1639.17 ops/ms
Scalability = 0.42%
Throughput Drop = 70.65%

8 Threads, 30000000 Operations, Times:
CPU 0 = 23025.25 ms, 1302.92 ops/ms
CPU 1 = 23645.38 ms, 1268.75 ops/ms
CPU 2 = 23955.23 ms, 1252.34 ops/ms
CPU 3 = 24038.93 ms, 1247.98 ops/ms
CPU 4 = 24069.25 ms, 1246.40 ops/ms
CPU 5 = 24072.82 ms, 1246.22 ops/ms
CPU 6 = 23702.26 ms, 1265.70 ops/ms
CPU 7 = 22637.75 ms, 1325.22 ops/ms
Total Time = 24072.83 ms
Avg Thread Time = 23643.36 ms
Total Ops/ms = 1268.86 ops/ms
Scalability = 0.28%
Throughput Drop = 77.41%

gg