1. ## optimizing bubble sort

Hello everyone,

at first I didn't care about the speed of my bubble sort function - a working program was/is enough ... but today I found out that some newbie was able to write a faster bubble sort function than me. So, I decided to tweak the sourcecode a little bit.
I tried to reach the performance of the bubble sort function (sourcecode still undisclosed) written by one of our tutors at my university. My function is still about 30% slower than his and I don't know why.
Here is my current sourcecode - I hope you can help me to make it faster and find possible bugs (I guess there are some but I am too tired at the moment to think clearly):
Code:
```#define N 10240
...
int main()
{
int array[N];
...
return 0;
}
...
{
// strange but register makes a difference on cygwin gcc 3.2
register int i, tmp, tmp1, test, *pointer;

counter = N-2;

do
{
test = 1;
i = 0;
pointer = array;

do
{
// new code ~25% faster
if (*pointer > *(pointer+1))
{
tmp = *pointer;
*pointer = *(pointer+1);
*(pointer+1) = tmp;
test = 0;
}
// old code -> slow
/*if (pointer[i] > pointer[i+1])
{
tmp = pointer[i+1];
pointer[i+1] = pointer[i];
pointer[i] = tmp;
test = 1;
}*/
++pointer;

//	i++ and counter = N-2 is faster than ++i and counter = N-1
// 	I guess I am making something wrong here ... though the array is sorted
//	correctly ...
}while(i++ < counter);

if(test) return 1;

}while(--counter >= -1);

return 1;
}```

2. You really shouldn't be wasting your time optimizing bubble
sort Are you even using -O2 with gcc? I don't think
you would have the problem with register if you did.

This is probably fairly fast on almost
sorted input.

Code:
```void bubble_sort(int A[], int n)
{
int i, j;
int clean_pass = 0;

for (i = 0; !clean_pass && i < n - 1; ++i) {
clean_pass = 1;
for (j = i + 1; j < n; ++j) {
if (A[i] > A[j]) {
int t = A[i];
A[i] = A[j];
A[j] = t;
clean_pass = 0;
}

}
}
}```

3. Thank you for the sourcecode, Nick. You're right, the code is fast, but mine is about 10% faster. I don't know what my tutor could've done to make the C code fly without using compiler optimizations or a faster computer ... and I've yet to find a tutor at my university who knows more about C and sourcecode optimizations than me ... but maybe I've found my master?
Of course, compiler optimizations can make the program fly but I wanted to know how much more speed is possible without using them.

4. Originally posted by Sargnagel
Thank you for the sourcecode, Nick. You're right, the code is fast, but mine is about 10% faster. I don't know what my tutor could've done to make the C code fly without using compiler optimizations or a faster computer ... and I've yet to find a tutor at my university who knows more about C and sourcecode optimizations than me ... but maybe I've found my master?
Of course, compiler optimizations can make the program fly but I wanted to know how much more speed is possible without using them.

I tried once to optimize bubblesort but the speed depends on the compiler.

You have to turn on optimization otherwise a
if(a<b)
{
t=a;
a=b;
b=a;
}

is in assembler a
cmp ; compare both values
mov ; move both values in register
mov

mov
mov
mov ; switch both values

but with optimizations you may get
cmp ; compare both values
xchg ; change both variables

oder xchg is subsituted with
mov
mov
because the values already were in registers because of the cmp

you have to look at the assembler code taht your compiler produce, otherwise you don't have a chanze.

5. Thanks alot, Shade! I will have a look at the assembler code ... but first I will have to learn assembler (hello, rafe ).

I will check out some other compilers, too.

... yes, it is a waste of time optimizing bubble sort but on the other hand it is a nice little practice.

6. Lessons in premature optimisation
Code:
```#include <stdio.h>
#include <stdlib.h>

// no -DN=nnn on command line
#ifndef N
#define N 10240
#define NO_DUMP
#endif

// read pentium fast clock with gcc3.2
#define RDTSC(llptr) { \
__asm__ __volatile__ ( \
"rdtsc" \
: "=A" (llptr) \
); }

int qsort_cmp ( const void *a, const void *b ) {
const int *pa = a;
const int *pb = b;
if ( *pa > *pb ) return +1;
if ( *pa < *pb ) return -1;
return 0;
}

// simple bubble sort with no attempt at optimisation
// to act as reference for improvement comparisions
void bubble_reference ( int arr[], int len ) {
int sorted, scan;
for ( sorted = len-1 ; sorted >= 0 ; sorted-- ) {
for ( scan = 0 ; scan < sorted ; scan++ ) {
if ( arr[scan+1] < arr[scan] ) {
int temp = arr[scan+1];
arr[scan+1] = arr[scan];
arr[scan] = temp;
}
}
}
}

// exchange unsorted elements with the end element
// not strictly a bubble sort anymore (I don't think)
void bubble_salem ( int arr[], int len ) {
int sorted, scan;
for ( sorted = len-1 ; sorted >= 0 ; sorted-- ) {
for ( scan = 0 ; scan < sorted ; scan++ ) {
if ( arr[sorted] < arr[scan] ) {
int temp = arr[sorted];
arr[sorted] = arr[scan];
arr[scan] = temp;
}
}
}
}

void bubble_sort_nick (int A[], int n) {
int i, j;
int clean_pass = 0;

for (i = 0; !clean_pass && i < n - 1; ++i) {
clean_pass = 1;
for (j = i + 1; j < n; ++j) {
if (A[i] > A[j]) {
int t = A[i];
A[i] = A[j];
A[j] = t;
clean_pass = 0;
}
}
}
}

int adv_bubble(int array[], int len) {
register int i, tmp, tmp1, test, *pointer;
int counter = len-2;
do {
test = 1;
i = 0;
pointer = array;

do {
// new code ~25% faster
if(*pointer > *(pointer+1)) {
tmp = *pointer;
*pointer = *(pointer+1);
*(pointer+1) = tmp;
test = 0;
}
// old code -> slow
/*if (pointer[i] > pointer[i+1])
{
tmp = pointer[i+1];
pointer[i+1] = pointer[i];
pointer[i] = tmp;
test = 1;
}*/
++pointer;

//i++ and counter = N-2 is faster than ++i and counter = N-1
// I guess I am making something wrong here ... though the array is sorted
//correctly ...
} while(i++ < counter);

if ( test ) return 1;
} while(--counter >= -1);
return 1;
}

int main ( ) {
int array0[N], array1[N],array2[N],array3[N],array4[N],array5[N];
unsigned long long start, end;
unsigned long long t1, t2, t3, t4, t5;
int i;

// prepare 5 identical data sets
for ( i = 0 ; i < N ; i++ ) array0[i] = rand() % N;
memcpy( array1, array0, sizeof(array0) );
memcpy( array2, array0, sizeof(array0) );
memcpy( array3, array0, sizeof(array0) );
memcpy( array4, array0, sizeof(array0) );
memcpy( array5, array0, sizeof(array0) );

RDTSC(start);
qsort( array1, N, sizeof(array1[0]), qsort_cmp );
RDTSC(end); t1 = end-start;

RDTSC(start);
bubble_reference( array2, N );
RDTSC(end); t2 = end-start;

RDTSC(start);
bubble_sort_nick( array3, N );
RDTSC(end); t3 = end-start;

RDTSC(start);
RDTSC(end); t4 = end-start;

RDTSC(start);
bubble_salem( array5, N );
RDTSC(end); t5 = end-start;

/* check results */
#ifndef NO_DUMP
/* if a value for N was specified on the command line, print everything */
for ( i = 0 ; i < N ; i++ )
printf( "%4d %4d %4d %4d %4d %4d\n",
array0[i], array1[i], array2[i], array3[i], array4[i], array5[i] );
#endif
for ( i = 0 ; i < N ; i++ ) {
if ( array1[i] != array2[i] ) printf( "Array2 diff at %d\n", i );
if ( array1[i] != array3[i] ) printf( "Array3 diff at %d\n", i );
if ( array1[i] != array4[i] ) printf( "Array4 diff at %d\n", i );
if ( array1[i] != array5[i] ) printf( "Array5 diff at %d\n", i );
}

printf( "Qsort            time=%lld\n", t1 );
printf( "bubble_reference time=%lld\n", t2 );
printf( "bubble_sort_nick time=%lld\n", t3 );
printf( "bubble_salem     time=%lld\n", t5 );

return 0;
}```
I get these results
Code:
```E:\code>gcc prog.c
E:\code>a.exe
Qsort            time=8190385
bubble_reference time=1738486896
bubble_sort_nick time=1451780046
bubble_salem     time=1109674714

E:\code>gcc -O2 prog.c
E:\code>a.exe
Qsort            time=7998630
bubble_reference time=614753262
bubble_sort_nick time=658336716
bubble_salem     time=416393171```
Surprise news!!!!
When the optimiser is turned on, the reference implementation beats the two attempts to improve things.

To Sargnagel
I wonder if your tutor has implemented bubble_salem() and not a true bubble sort?

7. Why are you guys even bothering with bubblesort? Optimizing bubblesort is an exercise in futility, there's no way to improve the algorithm beyond quadratic performance and still have bubblesort. ;-)

8. @Salem:

Woohooo!
*bows before the master*
Surprise news!!!!
When the optimiser is turned on, the reference implementation beats the two attempts to improve things.
That's indeed a surprise ... even more surprising to me is, that both improved functions perform about equally "bad" when 'gcc -O3' is used. And even your function runs a little slower with O3.
I wonder if your tutor has implemented bubble_salem() and not a true bubble sort?
Hehe ... would've been quite evil.
I managed to check my new function a few hours ago at a Sun UltraSparc workstation at my university. The program runs about 20% faster than my tutor's version.
Finally, everything is alright again ... at least for me.

Thanks to everyone for helping me out.

Cela
Why are you guys even bothering with bubblesort? Optimizing bubblesort is an exercise in futility, there's no way to improve the algorithm beyond quadratic performance and still have bubblesort. ;-)
It was just ment as a little excercise. It's truly pointless for a real world application (I guess). I just felt I had to write a faster function than my tutor ... and I succeeded.

9. Just another "little" excercise

Here is the current version of my bubble sort function:
Code:
```#define N 10240

{
register int counter = N -2, *start = array;

do
{
register unsigned int test = 1, i = 0;
array = start;

do
{
if(*array > *(array+1))
{
register int tmp = *array;
*array = *(array+1);
*(array+1) = tmp;
test = 0;
}
++feld;

}while(i++ < counter);

if(test) return 1;

}while(counter--);

return 1;
}```
What about the while condition of the inner loop? Can it be optimized? I mean, can it be replaced by a more exotic expression? Something that's faster than an addition? It would surely help alot.

10. Well you could unroll the inner loop, so that you do more sorting and less comparing, something akin to duff's device for fast memory copies.

You'll probably need to write the compare and swap part as a macro, to make the actual loop somewhat readable, because you'll end up with say 8 copies of the same code!

11. And again thanks for your help, Salem. I will try to implement Duff's Device over the next few days.

12. hehe ... okay ... I couldn't wait any longer ... here is my first try on Duff's Device:
Code:
```#define N 10240

#define COMP_SWAP(x,y)	if(*x > *(y+1))\
{\
tmp = *x;\
*x = *(y+1);\
*(y+1) = tmp;\
test = 0;\
}\
++x;

{
int counter = N-1;
unsigned int *start = array;

do
{
register unsigned int test = 1, tmp;
array = start;

register int n=(counter+7)/8;
switch(counter%8)
{
case 0:	do{	COMP_SWAP(array,array)
case 7:		COMP_SWAP(array,array)
case 6:		COMP_SWAP(array,array)
case 5:		COMP_SWAP(array,array)
case 4:		COMP_SWAP(array,array)
case 3:		COMP_SWAP(array,array)
case 2:		COMP_SWAP(array,array)
case 1:		COMP_SWAP(array,array)
}while(n--);
}

if(test) return 1;

}while(counter--);

return 1;
}```
It's a little bit slower than my previous version. There is still some overhead that needs to be reduced ... but I don't know how I could merge the incrementation (++)of the pointer array into the swap part to avoid one extra +1 addition. But if the if-clause is false, the pointer array must be incremented, too. With if ... else it becomes too slow. Is there another way?

I've replaced the adv_bubble function in your program, Salem, with this new version to measure the performance correctly. Here are the results:
• no gcc optimizations:
Qsort time =6107803
bubble_reference time=1362174983
bubble_sort_nick time =1155297212
duff_bubble time =1061755064
bubble_salem time =817406709

with gcc -O2:
Qsort time =5352142
bubble_reference time=446172591
bubble_sort_nick time =493390219
duff_bubble time =401223056
bubble_salem time =318042164

The Duff's Device source code can obviously be much better optimized by cygwin gcc 3.2.

Another problem:
When I use the code provided by my tutor to measure the time of the calculation with the Duff's Device BubbleSort function (yes, the field is sorted correctly), the displayed floating point value is very, very large (i.e. 875246141433.192017 ms) Here is the code:
Code:
```#include <sys/time.h>

double sub_time( struct timeval start,struct timeval stop)
{
double  t1,t2;

t1=(double) start.tv_usec;
t1/=1000000;

t1+=start.tv_sec;
t2=stop.tv_usec;
t2/=1000000;
t2+=stop.tv_sec;

return (t2-t1)*1000;
}

int main()
{
struct timeval start,stop;
...
gettimeofday(&start,NULL);

gettimeofday(&stop,NULL);

printf("time: %f\n", sub_time(start,stop));
...
return 0;
}```

13. Code:
```register int n=(counter+7)>>3;
switch(counter%8)
{
...```
I've replaced the division by 8 with a right shift operation that does the same but is a little bit faster.
I'm currently trying to improve the swap part. I will edit this post, when I've been successful.

Here you are:
Code:
```#define N 10240

#define COMP_SWAP(x,y)	if(*x > *y)\
{\
tmp = *x;\
*x = *y;\
*y = tmp;\
test = 0;\
}\
x = y++;

int duff_bubble(int array[])
{
int counter = N-1;
unsigned int *start = array;

do
{
register unsigned int test = 1, tmp, *next = start+1;
array = start;

register int n=(counter+7)>>3;
switch(counter%8)
{
case 0:	do{	COMP_SWAP(array,next)
case 7:		COMP_SWAP(array,next)
case 6:		COMP_SWAP(array,next)
case 5:		COMP_SWAP(array,next)
case 4:		COMP_SWAP(array,next)
case 3:		COMP_SWAP(array,next)
case 2:		COMP_SWAP(array,next)
case 1:		COMP_SWAP(array,next)
}while(--n);
}

if(test) return 1;

}while(counter--);

return 1;
}```
I've modified the COMP_SWAP macro to reduce the number of pointer increments. I've introduced another pointer called 'next'. This pointer points at the element 'start+1'. So, no more increments are needed in the swap part. And afterwards I only need one address assignment and one increment.
First tests show that this new function is maybe a little faster than my adv_bubble_sort function (no Duff's device). But I will have to modify Salem's program first to be absolutely sure.
Although the compiler cannot optimize it (cygwin gcc -O2) as good as my previous Duff's device sort.

Here are the results:
• cygwin gcc 3.2
Qsort time=7252293
bubble_reference time=1341713751
bubble_sort_nick time=1149134359
duff_bubble time=1034310584
bubble_salem time=802085318

cygwin gcc 3.2 -O2
Qsort time=5210652
bubble_reference time=441168372
bubble_sort_nick time=488427096
duff_bubble time=425535524
bubble_salem time=315937090

Sadly, duff_bubble is unoptimized still a little bit slower than adv_bubble. ... the performance hunt goes on
[/edit]

14. > gettimeofday(&start,NULL);
I'd check the return result, to make sure the function was successful

15. [latest edit]
A test of adv_bubble, duff_bubble and salem_bubble at the university on a Sun UltraSparc 5 was quite amazing:

gcc 2.95.3 no optimizations:

duff_bubble 3381 ms
salem_bubble 5633 ms

On the Intel Plattform (P3) adv_ and duff_bubble perform about the same and salem_bubble is the fastest function.

With -O2 turned on (UltraSparc) adv_ and duff_bubble perform about the same again (~1750 ms) and salem_bubble is back on the top with ~1350ms.

I wonder if it is the architecture or the compiler that's making the difference ?
[/latest edit]

> gettimeofday(&start,NULL);
I'd check the return result, to make sure the function was successful
The function was not successful. I don't know why. It was working before, but after implementing Duff's device the clock has gone mad. Maybe the Duff-sort now somehow changes data out of the bounds of the array. I will check this.

Still, your program is working wonderfully as you can see in my posting above.

Hmm ... I don't think, that the Duff-sort is related to the gettimeofday-problem, because the first time check is done before duff_bubble is called.
But what could be the error source?
[/edit]

The Duff-sort function was the problem. I forgot to adjust the array length and to change from post- to preincrement. The gettimeofday function still returns an error but the time display seems to be okay.
[/edit]

I've found a little bug in the outer loop. It must be
while(--counter)

and NOT while(counter--)

I've discovered this bug, after I've removed the variable "test" and "if(test) return 1;".
The Duff-sort is without this testing even faster than my adv bubble sort with testing removed, too.

• cygwin gcc 3.2 unoptimized
Qsort time=6165706
bubble_reference time=1333714207
bubble_sort_nick time=1137563691
duff_bubble time=986999012
bubble_salem time=801146807