The diffs aren’t pretty. A lot of the code went through a “beautifier” (macro run out of habit).
I didn’t hack any further than the freelist code, and just the freelist benchmark on the test side. Here’s the important stuff:
ReadMe.txt
- LFDS doesn’t actually support IA64. _InterlockedCompareExchange128 isn’t available for IA64 (I don’t think is has a true dcas instruction). There are references to IA64 on your Wikipedia too.
Inc\liblfds.h
- I wanted to consolidate all the “preprocessor know-how” into a single location – making it easier to recognize and implement various ports. I came up with the following standard defines:
Compiler (currently supported):
LIBLFDS_MSVC
LIBLFDS_GCC_UNIX - GCC on *nix (not Cygwin, not MinGW)
Environment (available headers, functions)
LIBLFDS_WINDOWS - PSDK or DDK environment
LIBLFDS_POSIX - Posix environment
Pointer Width:
LIBLFDS_32BIT
LIBLFDS_64BIT
Architecture (optional, add as needed):
LIBLFDS_X86
LIBLFDS_ARM
- I added inline support for abstraction_[dcac|cas|increment] under gcc. Part of that included marking then extern here. (more on inline’ing under gcc below).
- The FREELIST_COUNT_DCAS stuff was something I did for the freelist benchmark that allowed each thread to get a count of the number of dcas retries.
src\liblfds_internal.h
- Added include guards to headers that didn’t have em.
- I moved some things from liblfds.h to here that seemed like they didn’t need to be “publicly exposed” – ALIGN, INLINE, some headers.
- INLINE – This is used by abstraction_ [dcac|cas|increment]. The trick here is to get a publicly exposed symbol that external modules can link to, while having the compiler inline any internal calls to those functions (within the lib/dll/so). Under MSVC this is easy via “extern __forceinline”. Under gcc the easiest thing to do is to compile the inline functions as non-inline in their own translation unit, to be linked in with the lib/dll/so. But when the rest of the lib is calling those functions, they’re defined inline. To achieve this, abstraction_cas.c, abstraction_dcas.c, and abstraction_increment.c are all #include’d at the end of liblfds_internal.h. There is a new source file, abstraction_inlines.c, which just includes liblfds_internal.h with INLINE disabled. This is only compiled/linked under gcc.
Here’s the reference material I used for this:
http://gcc.gnu.org/gcc-4.3/porting_to.html
http://www.greenend.org.uk/rjk/2003/03/inline.html
- dcas_ptr – What I really wanted was a type that I could throw anywhere and know that it had correct alignment, without having to remember to use the ALIGN macro. This is what I came up with. All the dcas_ptr_x() macros are an ‘attempt’ to make typical usage a little more readable. Another advantage is that you can omit any packing on structures. The compiler just does the right thing since it knows the alignment requirements of ‘dcas_ptr’.
src\abstraction\abstraction_ [dcac|cas|increment].c
- Uses LIBLFDS_* directives.
- Added #error if none of the ports within the file are selected for compilation. This seemed easier than trying to make liblfds.h a catch-all for platforms which may not have everything ported to it.
src\freelist\freelist_internal.h
- USE_ORIG_FREELIST – I had that in there just make it easier to test the difference between the dcas_ptr freelist, the version 6 freelist.
- volatile isn’t needed on any of the freelist structure members. All “synchronization” is handled by dcas.
- I originally confused “element_count” as a dynamic count. So I renamed it to “capacity”.
src\freelist\freelist_pop_push.c
- dcas_ptr versions of push/pop
- atomic increment doesn’t scale well at all on x86, so I eliminated it. The trade-off is that both pop and push now perform a local, non-atomic increment within the DCAS loop. (In the end, I think there was only a small improvement in scalability on my Core i7 920. But may benefit other platforms even more.)
test\src\abstraction.h
- Added some new abstractions:
- thread_condvar_t – This behaves just like a Win32 event. It can be signaled or non-signaled and threads can wait, signal, and reset. The benchmark code uses this as a sort of barrier to ensure that all created threads are at the same point before beginning. The signaling of the condvar starts the benchmark.
- abstraction_thread_yield(ms) – Main benchmark thread uses this after creating threads to allow them to all to reach to the “starting point”.
- thread_timer_t – I added this because I wanted more resolution than clock(). It uses QueryPerformanceCounter() on Windows and clock_gettime(CLOCK_REALTIME) under Posix.
- (These need to implemented for DDK platform)
test\src\abtraction_thread_start.c
- Added thread priority to Windows, added thread affinity and priority to Posix (had to run via sudo under Linux for priority elevation)
test\src\benchmark_freelist.c
- My attempt at gathering and calculating metrics. Let me know if you see any bugs, or if I can clarify on what I thought I was doing.
makefile.linux
- Removed abstraction_cas.c, abstraction_dcas.c, abstraction_increment.c
- Added abstraction_inlines.c
- Added to base: “-pedantic -D_XOPEN_SOURCE=600 -D_GNU_SOURCE" (gnu source for sched_setaffinity)
- Added to release: “-O3 -Winline –DNDEBUG”
- Removed –s so I could get disassembly listings from gdb (to make sure stuff was getting inlined for real).
test\src\makefile.linux
- Added abstraction_timer.c, abstraction_thread_condvar.c
- Added “-pthread –lrt” (Prefer –pthread over –lpthread. “rt” needed for timer stuff)
The VS project/solution currently only does libs. But all the project settings should be good. You can tweak makefile.windows by examining the command line that the solution/project generates when compiling (particularly the optimization settings).
Here are some x64 benchmark results from my core i7 920 box running Win7:
Release 0 Freelist Benchmark #1
1 Threads, 30000000 Operations, Times:
CPU 0 = 536.06 ms, 55964.07 ops/ms
Total Time = 536.09 ms
Avg Thread Time = 536.06 ms
Total Ops/ms = 55964.07 ops/ms
Scalability = 100.00%
Throughput = [Baseline]
2 Threads, 30000000 Operations, Times:
CPU 0 = 3391.45 ms, 8845.77 ops/ms
CPU 2 = 3353.54 ms, 8945.78 ops/ms
Total Time = 3391.47 ms
Avg Thread Time = 3372.49 ms
Total Ops/ms = 8895.50 ops/ms
Scalability = 7.95%
Throughput Drop = 15.90%
3 Threads, 30000000 Operations, Times:
CPU 0 = 6977.94 ms, 4299.26 ops/ms
CPU 2 = 6869.21 ms, 4367.31 ops/ms
CPU 4 = 6955.57 ms, 4313.09 ops/ms
Total Time = 6977.97 ms
Avg Thread Time = 6934.24 ms
Total Ops/ms = 4326.36 ops/ms
Scalability = 2.58%
Throughput Drop = 48.64%
4 Threads, 30000000 Operations, Times:
CPU 0 = 10240.62 ms, 2929.51 ops/ms
CPU 2 = 10237.61 ms, 2930.37 ops/ms
CPU 4 = 10229.84 ms, 2932.60 ops/ms
CPU 6 = 10200.86 ms, 2940.93 ops/ms
Total Time = 10240.64 ms
Avg Thread Time = 10227.23 ms
Total Ops/ms = 2933.34 ops/ms
Scalability = 1.31%
Throughput Drop = 67.80%
5 Threads, 30000000 Operations, Times:
CPU 0 = 10320.27 ms, 2906.90 ops/ms
CPU 1 = 9942.31 ms, 3017.41 ops/ms
CPU 2 = 9283.74 ms, 3231.46 ops/ms
CPU 4 = 9302.25 ms, 3225.03 ops/ms
CPU 6 = 9313.67 ms, 3221.07 ops/ms
Total Time = 10320.30 ms
Avg Thread Time = 9632.45 ms
Total Ops/ms = 3114.47 ops/ms
Scalability = 1.11%
Throughput Increase = 106.17% [6.17%]
6 Threads, 30000000 Operations, Times:
CPU 0 = 15157.39 ms, 1979.23 ops/ms
CPU 1 = 15151.46 ms, 1980.01 ops/ms
CPU 2 = 15161.49 ms, 1978.70 ops/ms
CPU 3 = 15167.51 ms, 1977.91 ops/ms
CPU 4 = 8447.68 ms, 3551.27 ops/ms
CPU 6 = 8491.80 ms, 3532.82 ops/ms
Total Time = 15167.53 ms
Avg Thread Time = 12929.55 ms
Total Ops/ms = 2320.27 ops/ms
Scalability = 0.69%
Throughput Drop = 74.50%
7 Threads, 30000000 Operations, Times:
CPU 0 = 19867.78 ms, 1509.98 ops/ms
CPU 1 = 19700.22 ms, 1522.83 ops/ms
CPU 2 = 20145.11 ms, 1489.20 ops/ms
CPU 3 = 20149.48 ms, 1488.87 ops/ms
CPU 4 = 20174.15 ms, 1487.05 ops/ms
CPU 5 = 20185.63 ms, 1486.21 ops/ms
CPU 6 = 7891.37 ms, 3801.62 ops/ms
Total Time = 20185.65 ms
Avg Thread Time = 18301.96 ms
Total Ops/ms = 1639.17 ops/ms
Scalability = 0.42%
Throughput Drop = 70.65%
8 Threads, 30000000 Operations, Times:
CPU 0 = 23025.25 ms, 1302.92 ops/ms
CPU 1 = 23645.38 ms, 1268.75 ops/ms
CPU 2 = 23955.23 ms, 1252.34 ops/ms
CPU 3 = 24038.93 ms, 1247.98 ops/ms
CPU 4 = 24069.25 ms, 1246.40 ops/ms
CPU 5 = 24072.82 ms, 1246.22 ops/ms
CPU 6 = 23702.26 ms, 1265.70 ops/ms
CPU 7 = 22637.75 ms, 1325.22 ops/ms
Total Time = 24072.83 ms
Avg Thread Time = 23643.36 ms
Total Ops/ms = 1268.86 ops/ms
Scalability = 0.28%
Throughput Drop = 77.41%