# Thread: An exercise in optimization

1. ## An exercise in optimization

Okay, I'm bored so I'll give whoever is interested something to chew on. Whoever has the fastest practical code is the winner. That means that inline assembly will be a mark against you, as will increasing the size and complexity of the algorithm considerably (to restrict you stackers a bit ).

You've been given the task of optimizing a quicksort routine written by one of your lovely colleagues who is currently on vacation. The optimized code needs to be integrated before she returns, so you've been given the job of making her code faster. The sort has been profiled and determined to be too slow with the recent influx of new users giving the software considerably more data to work with. The routine is used extensively in the software, so you can't change the interface.

Because the maintenance programmers are typically junior engineers with little or no experience, you're forced to hold yourself back a little. The resulting code must be fast as possible without causing the maintenance people to have a heart attack trying to understand it.

At current, the code runs approximately n% slower than the standard C library function qsort. Sadly, the implementation of qsort on your machines interacts poorly with your software and cannot be used in place of the custom sort. Because your lovely colleague was so careful in her original implementation, you find a handy scaffolding program to help you out. Here it is:
Code:
```#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_BUF 50
#define CUTOFF  18

typedef int (*Comparable) ( const void *a, const void *b );

void jw_insertion_sort ( void *base, int size, int n, Comparable cmp )
{
unsigned char save[MAX_BUF];
unsigned char *cbase = base;
int i, j;

for ( i = 1; i < n; i++ ) {
memcpy ( save, cbase + ( i * size ), size );
for ( j = i; j > 0 && cmp ( cbase + ( ( j - 1 ) * size ), save ) > 0; j-- )
memcpy ( cbase + ( j * size ), cbase + ( ( j - 1 ) * size ), size );
memcpy ( cbase + ( j * size ), save, size );
}
}

void swap_block ( void *a, void *b, int size )
{
unsigned char save_block[MAX_BUF];

memcpy ( save_block, a, size );
memcpy ( a, b, size );
memcpy ( b, save_block, size );
}

void jw_quick_sort ( void *base, int size, int low, int high, Comparable cmp )
{
unsigned char save[MAX_BUF];
unsigned char *cbase = base;
int left = low, right = high + 1;
int r;

if ( high - low < CUTOFF )
return;
r = (int) ( ( (double)rand() * ( high - low ) ) / (double)RAND_MAX + low );
swap_block ( cbase + ( low * size ), cbase + ( r * size ), size );
memcpy ( save, cbase + ( low * size ), size );
for ( ; ; ) {
while ( ++left <= high && cmp ( cbase + ( left * size ), save ) < 0 )
;
while ( --right >= 0 && cmp ( cbase + ( right * size ), save ) > 0 )
;
if ( left > right )
break;
swap_block ( cbase + ( left * size ), cbase + ( right * size ), size );
}
swap_block ( cbase + ( low * size ), cbase + ( right * size ), size );
jw_quick_sort ( base, size, low, right - 1, cmp );
jw_quick_sort ( base, size, right + 1, high, cmp );
}

void jw_sort ( void *base, int size, int n, Comparable cmp )
{
jw_quick_sort ( base, size, 0, n - 1, cmp );
jw_insertion_sort ( base, size, n, cmp );
}

int compare ( const void *a, const void *b )
{
int *aa = (int *)a;
int *bb = (int *)b;

if ( *aa < *bb )
return -1;
else if ( *aa > *bb )
return +1;
else
return 0;
}

#include <time.h>

int main ( void )
{
int a0;
int *a1 = malloc ( 1000000 * sizeof ( int ) );
int i, n;
clock_t start;
double jw_time, q_time;

for ( i = 0; i < 10; i++ )
a0[i] = rand();
for ( i = 0; i < 10; i++ )
printf ( "%d ", a0[i] );
printf ( "\n" );
jw_sort ( a0, sizeof ( int ), 10, compare );
for ( i = 0; i < 10; i++ )
printf ( "%d ", a0[i] );
printf ( "\n" );
for ( n = 0; n < 10; n++ ) {
for ( i = 0; i < 1000000; i++ )
a1[i] = rand();
start = clock();
jw_sort ( a1, sizeof ( int ), 1000000, compare );
jw_time = ( (double)clock() - start ) / CLOCKS_PER_SEC;
printf ( "jw_sort: %f  ", jw_time );
for ( i = 0; i < 1000000; i++ )
a1[i] = rand();
start = clock();
qsort ( a1, 1000000, sizeof ( int ), compare );
q_time = ( (double)clock() - start ) / CLOCKS_PER_SEC;
printf ( "qsort: %f  Difference: %.0f%%\n", q_time, ( 1.0 - ( q_time / jw_time ) ) * 100 );
}
free ( a1 );

return 0;
}```
The scaffolding reinforces what you learned while profiling the code, jw_sort runs about n% slower than qsort. At the time that the code was written this was an acceptable difference so the author chose not to complicate the functions any more.

Use any means you want to make jw_sort as fast as possible without causing the poor maintenance programmers fits. The data that the software processes can be anything from single characters to large heterogeneous records, so a generalized routine is required. Unfortunately, your colleague trusts that her co-workers are not stupid and is sparse with her commenting. You have to figure out her algorithm on your own. Since there aren't any comments, it shouldn't be too unusual of a solution you believe.

Back to reality

This is a reasonably realistic situation, which is a departure from most of our contests. But you'll find yourself learning quite a bit about quicksort by optimizing a typical implementation that you may have to decipher first. This is not a closed entry contest, so as soon as you have a submission, feel free to post it for everyone to see. This promotes multiple submissions and a high level of competitiveness since your opponents will be able to use your ideas in their own solutions. There is no time limit. Your colleague takes long vacations. Enjoy!  2. wow, I can't even imagine. Good luck people who attempt this feat. 3. Hmm... I think I might try to do this a bit later to see what I can come up with. Are you allowed to optimise the main() routine as well?

Edit: Err... nm. Now that I looked at it closely, optimising main() wouldn't make any difference. 4. >Are you allowed to optimise the main() routine as well?
As long as the interface to the function isn't changed (because it is already in use), you can optimize whatever you want. And there is one big improvement that effects speed poorly if you choose to implement it.  5. is this still going on? 6. >is this still going on?
Yes. 7. Originally Posted by Prelude
Because the maintenance programmers are typically junior engineers with little or no experience
okay, that would be me, except i don't understant it at all what exactly does this do? 8. I think Prelude was just writing a program and messed up and wants us to clean it up.

The lazy little... :P

Just kidding ;-) 9. Okay - the ball is in your court I don't claim much except some minor changes in open source code, but it is better than our native qsort by along shot, so it is part of our run-time.

I get:
kcsdev:/home/jmcnama> cc isort.c -o isort +O4
kcsdev:/home/jmcnama> isort
16838 5758 10113 17515 31051 5627 23010 7419 16212 4086
4086 5627 5758 7419 10113 16212 16838 17515 23010 31051
jw_sort: 1.580000 qsort: 5.580000 Difference: -253%
jw_sort: 1.550000 qsort: 5.460000 Difference: -252%
jw_sort: 1.600000 qsort: 5.590000 Difference: -249%
jw_sort: 1.530000 qsort: 5.420000 Difference: -254%
jw_sort: 1.570000 qsort: 5.490000 Difference: -250%
jw_sort: 1.510000 qsort: 5.570000 Difference: -269%
jw_sort: 1.580000 qsort: 5.450000 Difference: -245%
jw_sort: 1.560000 qsort: 5.530000 Difference: -254%
jw_sort: 1.570000 qsort: 5.380000 Difference: -243%
jw_sort: 1.530000 qsort: 5.640000 Difference: -269%
Code we modified over time, based on source from OSF and
G Schmidt.
Code:
```Profiler output:
%Time Seconds Cumsecs  #Calls   msec/call  Name

40.8   86.35   86.35421060654        0.00  compare```
almost all of the time is spent in compare - 40.8 %

Code:
```#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_BUF 64
#define CUTOFF  18

#include <limits.h>
#include <stdlib.h>
#include <string.h>

#define SWAP(a, b, size)		    \
do							    \
{							    \
register size_t __size = (size);			\
register char *__a = (a), *__b = (b);		\
do								        \
{								            \
char __tmp = *__a;						\
*__a++ = *__b;						    \
*__b++ = __tmp;						    \
} while (--__size > 0);						\
} while (0)

#define INSERT_THRESH 4

typedef struct
{
char *lo;
char *hi;
} stack_node;

#define min(x, y) ((x) < (y) ? (x) : (y))
#define STACK_SIZE	(CHAR_BIT * sizeof(size_t))
#define PUSH(low, high)	((void) ((top->lo = (low)), (top->hi = (high)), ++top))
#define	POP(low, high)	((void) (--top, (low = top->lo), (high = top->hi)))
#define	STACK_NOT_EMPTY	(stack < top)

/*

PNMquicksort() libc.sl.2.04.a   jmc
This incorporates improvements and
optimizations (see Sedgwick, 1997)

Non-recursive, uses stack of pointers

Quicksorts just TOTAL_ELEMS / INSERT_THRESH partitions, falls into
insertion sort to order the INSERT_THRESH items within each partition.

The larger of the two sub-partitions is always pushed onto the
stack first.

Based on code from OSF 1990 and GNU (G. Schmidt).

GNU disclaimer (aka warranty) which is required to be in source derived from
GNU:

The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
version 1.1 of the License, or (at your option) any later version.

The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
Lesser General Public License for more details.

*/

typedef int (*Comparable) ( const void *a, const void *b );

void
PNMquicksort (void *const pbase,
size_t total_elems,
size_t size,
Comparable cmp_fx)
{
register char *base_ptr = (char *) pbase;
const size_t max_thresh = INSERT_THRESH * size;

if (total_elems == 0)
{
return;
}

if (total_elems > INSERT_THRESH)
{
char *lo = base_ptr;
char *hi = &lo[size * (total_elems - 1)];
stack_node stack[STACK_SIZE];
stack_node *top = stack + 1;

while (STACK_NOT_EMPTY)
{
char *left;
char *right;
char *mid = lo + size * ((hi - lo) / size >> 1);

if ((*cmp_fx) ((void *) mid, (void *) lo) < 0)
SWAP (mid, lo, size);
if ((*cmp_fx) ((void *) hi, (void *) mid) < 0)
SWAP (mid, hi, size);
else
goto skip;                            /* ugh a goto ! */
if ((*cmp_fx) ((void *) mid, (void *) lo) < 0)
SWAP (mid, lo, size);
skip:;
left  = lo + size;
right = hi - size;

do
{
while ((*cmp_fx) ((void *) left, (void *) mid) < 0)
{
left += size;
}
while ((*cmp_fx) ((void *) mid, (void *) right) < 0)
{
right -= size;
}
if (left < right)
{
SWAP (left, right, size);
if (mid == left)
mid = right;
else if (mid == right)
mid = left;
left += size;
right -= size;
}
else if (left == right)
{
left += size;
right -= size;
break;
}
}  while (left <= right);

if ((size_t) (right - lo) <= max_thresh)
{
if ((size_t) (hi - left) <= max_thresh)
POP (lo, hi);
else
lo = left;
}
else if ((size_t) (hi - left) <= max_thresh)
hi = right;
else if ((right - lo) > (hi - left))
{
PUSH (lo, right);
lo = left;
}
else
{
PUSH (left, hi);
hi = right;
}
}
}

{ /* insertion sort */
char *const end_ptr = &base_ptr[size * (total_elems - 1)];
char *tmp_ptr = base_ptr;
char *thresh = min(end_ptr, base_ptr + max_thresh);
register char *run_ptr;

for (run_ptr = tmp_ptr + size; run_ptr <= thresh; run_ptr += size)
{
if ((*cmp_fx) ((void *) run_ptr, (void *) tmp_ptr) < 0)
{
tmp_ptr = run_ptr;
}
}

if (tmp_ptr != base_ptr)
{
SWAP (tmp_ptr, base_ptr, size);
}

run_ptr = base_ptr + size;
while ((run_ptr += size) <= end_ptr)
{
tmp_ptr = run_ptr - size;
while ((*cmp_fx) ((void *) run_ptr, (void *) tmp_ptr) < 0)
tmp_ptr -= size;

tmp_ptr += size;
if (tmp_ptr != run_ptr)
{
char *trav;

trav = run_ptr + size;
while (--trav >= run_ptr)
{
char c = *trav;
char *hi, *lo;

for (hi = lo = trav; (lo -= size) >= tmp_ptr; hi = lo)
{
*hi = *lo;
}
*hi = c;
}
}
}
}
}

void jw_sort ( void *base, const size_t total_elems, Comparable cmp )
{
PNMquicksort ( base, total_elems, sizeof(int), cmp );
}
/* improved compare */

int compare ( const void *a, const void *b )
{
return  *(int *)a - *(int *)b;
}

/* old slower compare */

int slow_compare ( const void *a, const void *b )
{
int *aa = (int *)a;
int *bb = (int *)b;

if ( *aa < *bb )
return -1;
else if ( *aa > *bb )
return +1;
else
return 0;
}

#include <time.h>

int main ( void )
{
int a0={0};
int *a1 = malloc ( 1000000 * sizeof ( int ) );
int i=0,
n=0;
clock_t start=0;
double jw_time=0.,
q_time=0.;

for ( i = 0; i < 10; i++ )
{
a0[i] = rand();
}
for ( i = 0; i < 10; i++ )
{
printf ( "%d ", a0[i] );
}
printf ( "\n" );
jw_sort ( a0, 10, compare );
for ( i = 0; i < 10; i++ )
{
printf ( "%d ", a0[i] );
}
printf ( "\n" );
for ( n = 0; n < 10; n++ )
{
for ( i = 0; i < 1000000; i++ )
{
a1[i] = rand();
}
start = clock();
jw_sort ( a1, 1000000, compare );
jw_time = ( (double)clock() - start ) / CLOCKS_PER_SEC;
printf ( "jw_sort: %f  ", jw_time );
for ( i = 0; i < 1000000; i++ )
{
a1[i] = rand();
}
start = clock();
qsort ( a1, 1000000, sizeof ( int ),compare );
q_time = ( (double)clock() - start ) / CLOCKS_PER_SEC;
printf ( "qsort: %f  Difference: %.0f%%\n",
q_time,
( 1.0 - ( q_time / jw_time ) ) * 100 );

}
free ( a1 );
return 0;
}``` 10. Man...it took a year for someone to crank out an answer? Forget it; I'm not entering any of Prelude's contests. That's some dedication. 11. jim mcnamara you've been here long enough to know you shouldn't bump year old threads. Popular pages Recent additions 