Sometimes "goto" is useful

**rcgldr** · 06-09-2019

Here is an example of a 4 way (no minheap) bottom up merge sort, which is about 15% faster than a 2 way merge sort and about as fast (sometimes a bit faster) than quick sort, on a processor with 16 registers, such as X86 in 64 bit mode, where the compiler optimization uses 8 of the registers for pointers to the current and end of runs. Trying to remove the goto's in this case would duplicate code and make it more complex. To implement this in Java, which doesn't have goto (but does reserve it), I had to switch to a top down merge sort, which is a bit slower, somewhat defeating the purpose of a 4 way merge sort (for performance).

One general reason for using goto is when there are multiple paths leading to common code, and this is an example of that.

Code:

int * BottomUpMergeSort(int a[], int b[], size_t n)
{
int *p0r;       // ptr to      run 0
int *p0e;       // ptr to end  run 0
int *p1r;       // ptr to      run 1
int *p1e;       // ptr to end  run 1
int *p2r;       // ptr to      run 2
int *p2e;       // ptr to end  run 2
int *p3r;       // ptr to      run 3
int *p3e;       // ptr to end  run 3
int *pax;       // ptr to set of runs in a
int *pbx;       // ptr for merged output to b
size_t rsz = 1; // run size
    if(n < 2)
        return a;
    if(n == 2){
        if(a[0] > a[1])std::swap(a[0],a[1]);
        return a;
    }
    if(n == 3){
        if(a[0] > a[2])std::swap(a[0],a[2]);
        if(a[0] > a[1])std::swap(a[0],a[1]);
        if(a[1] > a[2])std::swap(a[1],a[2]);
        return a;
    }
    while(rsz < n){
        pbx = &b[0];
        pax = &a[0];
        while(pax < &a[n]){
            p0e = rsz + (p0r = pax);
            if(p0e >= &a[n]){
                p0e = &a[n];
                goto cpy10;}
            p1e = rsz + (p1r = p0e);
            if(p1e >= &a[n]){
                p1e = &a[n];
                goto mrg201;}
            p2e = rsz + (p2r = p1e);
            if(p2e >= &a[n]){
                p2e = &a[n];
                goto mrg3012;}
            p3e = rsz + (p3r = p2e);
            if(p3e >= &a[n])
                p3e = &a[n];
            // 4 way merge
            while(1){
                if(*p0r <= *p1r){
                    if(*p2r <= *p3r){
                        if(*p0r <= *p2r){
mrg40:                      *pbx++ = *p0r++;    // run 0 smallest
                            if(p0r < p0e)       // if not end run continue
                                continue;
                            goto mrg3123;       // merge 1,2,3
                        } else {
mrg42:                      *pbx++ = *p2r++;    // run 2 smallest
                            if(p2r < p2e)       // if not end run continue
                                continue;
                            goto mrg3013;       // merge 0,1,3
                        }
                    } else {
                        if(*p0r <= *p3r){
                            goto mrg40;         // run 0 smallext
                        } else {
mrg43:                      *pbx++ = *p3r++;    // run 3 smallest
                            if(p3r < p3e)       // if not end run continue
                                continue;
                            goto mrg3012;       // merge 0,1,2
                        }
                    }
                } else {
                    if(*p2r <= *p3r){
                        if(*p1r <= *p2r){
mrg41:                      *pbx++ = *p1r++;    // run 1 smallest
                            if(p1r < p1e)       // if not end run continue
                                continue;
                            goto mrg3023;       // merge 0,2,3
                        } else {
                            goto mrg42;         // run 2 smallest
                        }
                    } else {
                        if(*p1r <= *p3r){
                            goto mrg41;         // run 1 smallest
                        } else {
                            goto mrg43;         // run 3 smallest
                        }
                    }
                }
            }
            // 3 way merge
mrg3123:    p0r = p1r;
            p0e = p1e;
mrg3023:    p1r = p2r;
            p1e = p2e;
mrg3013:    p2r = p3r;
            p2e = p3e;
mrg3012:    while(1){
                if(*p0r <= *p1r){
                    if(*p0r <= *p2r){
                        *pbx++ = *p0r++;        // run 0 smallest
                        if(p0r < p0e)           // if not end run continue
                            continue;
                        goto mrg212;            // merge 1,2
                    } else {
mrg32:                  *pbx++ = *p2r++;        // run 2 smallest
                        if(p2r < p2e)           // if not end run continue
                            continue;
                        goto mrg201;            // merge 0,1
                    }
                } else {
                    if(*p1r <= *p2r){
                        *pbx++ = *p1r++;        // run 1 smallest
                        if(p1r < p1e)           // if not end run continue
                            continue;
                        goto mrg202;            // merge 0,2
                    } else {
                        goto mrg32;             // run 2 smallest
                    }
                }
            }
            // 2 way merge
mrg212:     p0r = p1r;
            p0e = p1e;
mrg202:     p1r = p2r;
            p1e = p2e;
mrg201:     while(1){
                if(*p0r <= *p1r){
                    *pbx++ = *p0r++;            // run 0 smallest
                    if(p0r < p0e)               // if not end run continue
                        continue;
                    goto cpy11;
                } else {
                    *pbx++ = *p1r++;            // run 1 smallest
                    if(p1r < p1e)               // if not end run continue
                        continue;
                    goto cpy10;
                }
            }
            // 1 way copy
cpy11:      p0r = p1r;
            p0e = p1e;
cpy10:      while (1) {
                *pbx++ = *p0r++;                // copy element
                if (p0r < p0e)                  // if not end of run continue
                    continue;
                break;
            }
            pax += rsz << 2;            // setup for next set of runs
        }
        std::swap(a, b);                // swap ptrs
        rsz <<= 2;                      // quadruple run size
    }
    return a;                           // return sorted array
}

**GReaper** · 06-10-2019

Of course, an experienced programmer knows when it's wise to ignore blanket statements like "never use goto".

The reason goto is shunned is probably because many assembly programmers back in the day turned to C but kept using goto instead of the language's higher-level statements.

**flp1969** · 06-10-2019

Originally Posted by GReaper

The reason goto is shunned is probably because many assembly programmers back in the day turned to C but kept using goto instead of the language's higher-level statements.

Historical fact: Dijkstra, in 1968, published an article in ACM titled Go To Statement considered harmful, a couple of years before C.

**rcgldr** · 06-11-2019

Originally Posted by flp1969

Historical fact: Dijkstra, in 1968, published an article in ACM titled Go To Statement considered harmful, a couple of years before C.

... and in 1974, Donald Knuth wrote "Structured Programming with go to Statements" (you can do a web search to find copies of the original paper). The point of creating this thread is that I rarely see "multiple paths to common code" mentioned as a reason for using goto, other than for exiting out of a function to clean up code if some type of exception occurs in nested code. The 4 way merge sort is a more generic case of multiple paths to common code.

In the 4 way merge sort example I've posted, the code uses nested if's to determine which run has the smallest leading element as opposed to using a minheap. There are two possible paths leading to each of the 4 cases. In addition, when nearing the end of an array, or when the end of a run is reached, there's a drop down to a 3 way merge, then 2 way merge, then a copy of the rest of the remaining run, all of which also involve two possible paths to reach.

**Niccolo** · 06-11-2019

I thought I'd have a go at this.

I know the "goto" subject can start religious level wars. I happen to be on the side of flp1969 (and Dijkstra) here.

I present an example re-reading of the OP algorithm. My point isn't to argue so much as to compare, especially for those interested in the point.

I decided not to take a 4 way algorithm from the web, though there are many examples, as that wouldn't really demonstrate what I wanted to point out.

Instead, I read @rcgldr's algorithm as posted, deciphering what it says, not what the algorithm itself does. I implemented the same algorithm, but without goto. I made no alterations to the algorithm itself, only the implementation and some details about it.

Where I use inline functions, I named them resembling the labels from which they were copied. Most of the code I present is copied, though a few alterations in detail. For example:

Code:

   while(1){
                if(*p0r <= *p1r){
                    *pbx++ = *p0r++;            // run 0 smallest
                    if(p0r < p0e)               // if not end run continue
                        continue;
                    goto cpy11;

I re-state as this in a function:

Code:

 while(1)
 {
    if(*p[0].r <= *p[1].r)
    {
        *pbx++ = *p[0].r++;            // run 0 smallest
        if(p[0].r >= p[0].e)           // if not end run continue
        {
         MergeCpy11( p );
         break;  
        }

The test is inverted so as to remove the continue with no change in the result.

The original algorithm is included with the name "BottomUpMergeSortOrg", whereas the alternative I submit for comparison is "BottomUpMergeSortAlt".

Code:

#include <cstdlib>
#include <iostream>
#include <ctime>


#define ORG

int * BottomUpMergeSortOrg(int a[], int b[], size_t n);
int * BottomUpMergeSortAlt(int a[], int b[], size_t n);




struct msptr
{
 int *r;
 int *e;
};




void Merge3012( msptr *p, int *&pax, int *&pbx, size_t & rsz );
void Merge201( msptr *p, int *& pax, int *&pbx, size_t &rsz );
void MergeCpy10( msptr *p, int *&pax, int *&pbx, size_t &rsz );


inline void Merge3013( msptr *p )
{
 p[2] = p[3];
}

inline void Merge3023( msptr *p )
{
 p[1] = p[2];
 p[2] = p[3];
}

inline void Merge3123( msptr *p )
{
 p[0] = p[1];
 p[1] = p[2];
 p[2] = p[3];
}

inline void Merge202(  msptr *p  )
{
 p[1] = p[2];
}

inline void MergeCpy11( msptr *p )
{
 p[0] = p[1];
} 

inline void Merge212( msptr *p )
{
 p[0] = p[1];
 p[1] = p[2];
}

inline bool Merge32( msptr *p, int *&pbx )
{
 *pbx++ = *p[2].r++;                 // run 2 smallest
 if(p[2].r < p[2].e) return true;    // if not end run continue
 return false;
}

inline bool Merge40( msptr *p, int *&pbx)
{
 *pbx++ = *p[0].r++;
 if ( p[0].r < p[0].e ) return true;
 Merge3123( p );
 return false;
}

inline bool Merge41( msptr *p, int *&pbx )
{
 *pbx++ = *p[1].r++;
 if ( p[1].r < p[1].e ) return true;
 Merge3023( p );
 return false;
}


inline bool Merge42( msptr *p, int *&pbx )
{
 *pbx++ = *p[2].r++; // run 2 smallest
 if ( p[2].r < p[2].e ) return true;

 Merge3013( p );
 return false;
}

inline bool Merge43( msptr *p, int *&pbx )
{
 *pbx++ = *p[3].r++;
 if ( p[3].r < p[3].e ) return true;
 return false;
}





void Merge201( msptr *p, int *& pax, int *&pbx, size_t &rsz )
{
 while(1)
 {
    if(*p[0].r <= *p[1].r)
    {
        *pbx++ = *p[0].r++;            // run 0 smallest
        if(p[0].r >= p[0].e)           // if not end run continue
        {
         MergeCpy11( p );
         break;  
        }
    } 
    else 
    {
        *pbx++ = *p[1].r++;            // run 1 smallest
        if(p[1].r >= p[1].e)           // if not end run continue
        {
         break; 
        }
    }
 }

 MergeCpy10( p, pax, pbx, rsz );
}

void MergeCpy10( msptr *p, int *& pax, int *&pbx, size_t &rsz )               
{
 *pbx++ = *p[0].r++;         // copy element
 
 while (p[0].r < p[0].e) 
 {
  *pbx++ = *p[0].r++;        // copy element
 }

 pax += rsz << 2;            // setup for next set of runs
}




int * BottomUpMergeSortAlt(int a[], int b[], size_t n)
{
 std::cout << "New...." << std::endl;

 msptr p[4];
 int *pax;       // ptr to set of runs in a
 int *pbx;       // ptr for merged output to b
 size_t rsz = 1;

 if(n < 2)
     return a;

 if(n == 2)
 {
     if(a[0] > a[1])std::swap(a[0],a[1]);
     return a;
 }
  
 if(n == 3)
 {
     if(a[0] > a[2])std::swap(a[0],a[2]);
     if(a[0] > a[1])std::swap(a[0],a[1]);
     if(a[1] > a[2])std::swap(a[1],a[2]);
     return a;
 }
 
 while(rsz < n)
 {
     pbx = &b[0];
     pax = &a[0];                     
 
     while(pax < &a[n])
     {
         p[0].e = rsz + (p[0].r = pax);
         if(p[0].e >= &a[n])
         {
             p[0].e = &a[n];
             MergeCpy10( p, pax, pbx, rsz );
             Merge201( p, pax, pbx, rsz );
             continue;
         }
         p[1].e = rsz + (p[1].r = p[0].e);
         if(p[1].e >= &a[n])
         {
             p[1].e = &a[n];
             Merge201( p, pax, pbx, rsz );
             continue; 
         }
         p[2].e = rsz + (p[2].r = p[1].e);
         if(p[2].e >= &a[n])
         {
             p[2].e = &a[n];
             Merge3012( p, pax, pbx, rsz );
             Merge201( p, pax, pbx, rsz );
             continue;
         }
         p[3].e = rsz + (p[3].r = p[2].e);
         if(p[3].e >= &a[n])
         {
             p[3].e = &a[n];
         }
 
         // 4 way merge
         while(1)
         {
             if(*p[0].r <= *p[1].r )
             {
                 if(*p[2].r <= *p[3].r)
                 {
                     if(*p[0].r <= *p[2].r)
                     {
                      if ( !Merge40( p, pbx ) ) break;
                     } 
                     else 
                     {
                      if ( !Merge42( p, pbx ) ) break;
                     }
                 } 
                 else 
                 {
                     if(*p[0].r <= *p[3].r)
                     {
                      if ( !Merge40( p, pbx ) ) break;
                     } 
                     else 
                     {
                      if ( !Merge43( p, pbx ) ) break;
                     }
                 }
             } 
             else 
             {
                 if(*p[2].r <= *p[3].r)
                 {
                     if(*p[1].r <= *p[2].r)
                     {
                      if ( !Merge41( p, pbx ) ) break;
                     } 
                     else 
                     {
                      if ( !Merge42( p, pbx ) ) break;
                     }
                 }   
                 else 
                 {
                     if(*p[1].r <= *p[3].r)
                     {
                       if ( !Merge41( p, pbx ) ) break;
                     } 
                     else 
                     {
                       if ( !Merge43( p, pbx ) ) break;
                     }
                 }
             }
         }
         // 3 way merge
 
         Merge3012( p, pax, pbx, rsz );       
         Merge201( p, pax, pbx, rsz );
     }

     std::swap(a, b);                // swap ptrs
     rsz <<= 2;                      // quadruple run size
 }
 
 return a;                           // return sorted array

}


void Merge3012( msptr *p, int *&pax, int *&pbx, size_t & rsz )
{
 while(1)
 {
     if(*p[0].r <= *p[1].r)
     {
         if(*p[0].r <= *p[2].r)
         {
             *pbx++ = *p[0].r++;        // run 0 smallest
             if(p[0].r >= p[0].e)           // if not end run continue
             { 
              Merge212( p );
              return;
             }
         } 
         else 
         {
          if ( !Merge32( p, pbx ) ) return;
         }
     } 
     else 
     {
         if(*p[1].r <= *p[2].r)
         {
             *pbx++ = *p[1].r++;        // run 1 smallest
             if(p[1].r >= p[1].e)           // if not end run continue
             {
              Merge202( p );
              return;
             }
         } 
         else 
         {
          if ( !Merge32( p, pbx ) ) return;
         }
     }
 }
}




// original


int * BottomUpMergeSortOrg(int a[], int b[], size_t n)
{
int *p0r;       // ptr to      run 0
int *p0e;       // ptr to end  run 0
int *p1r;       // ptr to      run 1
int *p1e;       // ptr to end  run 1
int *p2r;       // ptr to      run 2
int *p2e;       // ptr to end  run 2
int *p3r;       // ptr to      run 3
int *p3e;       // ptr to end  run 3
int *pax;       // ptr to set of runs in a
int *pbx;       // ptr for merged output to b
size_t rsz = 1; // run size

std::cout << "Org...." << std::endl;

    if(n < 2)
        return a;
    if(n == 2){
        if(a[0] > a[1])std::swap(a[0],a[1]);
        return a;
    }
    if(n == 3){
        if(a[0] > a[2])std::swap(a[0],a[2]);
        if(a[0] > a[1])std::swap(a[0],a[1]);
        if(a[1] > a[2])std::swap(a[1],a[2]);
        return a;
    }
    while(rsz < n){
        pbx = &b[0];
        pax = &a[0];
        while(pax < &a[n]){
            p0e = rsz + (p0r = pax);
            if(p0e >= &a[n]){
                p0e = &a[n];
                goto cpy10;}
            p1e = rsz + (p1r = p0e);
            if(p1e >= &a[n]){
                p1e = &a[n];
                goto mrg201;}
            p2e = rsz + (p2r = p1e);
            if(p2e >= &a[n]){
                p2e = &a[n];
                goto mrg3012;}
            p3e = rsz + (p3r = p2e);
            if(p3e >= &a[n])
                p3e = &a[n];
            // 4 way merge
            while(1){
                if(*p0r <= *p1r){
                    if(*p2r <= *p3r){
                        if(*p0r <= *p2r){
mrg40:                      *pbx++ = *p0r++;    // run 0 smallest
                            if(p0r < p0e)       // if not end run continue
                                continue;
                            goto mrg3123;       // merge 1,2,3
                        } else {
mrg42:                      *pbx++ = *p2r++;    // run 2 smallest
                            if(p2r < p2e)       // if not end run continue
                                continue;
                            goto mrg3013;       // merge 0,1,3
                        }
                    } else {
                        if(*p0r <= *p3r){
                            goto mrg40;         // run 0 smallext
                        } else {
mrg43:                      *pbx++ = *p3r++;    // run 3 smallest
                            if(p3r < p3e)       // if not end run continue
                                continue;
                            goto mrg3012;       // merge 0,1,2
                        }
                    }
                } else {
                    if(*p2r <= *p3r){
                        if(*p1r <= *p2r){
mrg41:                      *pbx++ = *p1r++;    // run 1 smallest
                            if(p1r < p1e)       // if not end run continue
                                continue;
                            goto mrg3023;       // merge 0,2,3
                        } else {
                            goto mrg42;         // run 2 smallest
                        }
                    } else {
                        if(*p1r <= *p3r){
                            goto mrg41;         // run 1 smallest
                        } else {
                            goto mrg43;         // run 3 smallest
                        }
                    }
                }
            }
            // 3 way merge
mrg3123:    p0r = p1r;
            p0e = p1e;
mrg3023:    p1r = p2r;
            p1e = p2e;
mrg3013:    p2r = p3r;
            p2e = p3e;
mrg3012:    while(1){
                if(*p0r <= *p1r){
                    if(*p0r <= *p2r){
                        *pbx++ = *p0r++;        // run 0 smallest
                        if(p0r < p0e)           // if not end run continue
                            continue;
                        goto mrg212;            // merge 1,2
                    } else {
mrg32:                  *pbx++ = *p2r++;        // run 2 smallest
                        if(p2r < p2e)           // if not end run continue
                            continue;
                        goto mrg201;            // merge 0,1
                    }
                } else {
                    if(*p1r <= *p2r){
                        *pbx++ = *p1r++;        // run 1 smallest
                        if(p1r < p1e)           // if not end run continue
                            continue;
                        goto mrg202;            // merge 0,2
                    } else {
                        goto mrg32;             // run 2 smallest
                    }
                }
            }
            // 2 way merge
mrg212:     p0r = p1r;
            p0e = p1e;
mrg202:     p1r = p2r;
            p1e = p2e;
mrg201:     while(1){
                if(*p0r <= *p1r){
                    *pbx++ = *p0r++;            // run 0 smallest
                    if(p0r < p0e)               // if not end run continue
                        continue;
                    goto cpy11;
                } else {
                    *pbx++ = *p1r++;            // run 1 smallest
                    if(p1r < p1e)               // if not end run continue
                        continue;
                    goto cpy10;
                }
            }
            // 1 way copy
cpy11:      p0r = p1r;
            p0e = p1e;
cpy10:      while (1) {
                *pbx++ = *p0r++;                // copy element
                if (p0r < p0e)                  // if not end of run continue
                    continue;
                break;
            }
            pax += rsz << 2;            // setup for next set of runs
        }
        std::swap(a, b);                // swap ptrs
        rsz <<= 2;                      // quadruple run size
    }
    return a;                           // return sorted array
}







int main()
{                           
 int *source;
 int *dest;

 srand( (unsigned int) time( NULL ) );

 const int count = 15000000;

 double avg{ 0.0 };

 for( int n = 0; n < 70; ++n )
 {
 source = new int[ count ];
 dest   = new int[ count ];

 for( int n = 0; n < count; ++n )
   {
    source[ n ] = rand();
    dest[ n ] = 0;
   }

 time_t s = clock();

#ifdef ORG

int * result = BottomUpMergeSortOrg( source, dest, count );

#else

int * result = BottomUpMergeSortAlt( source, dest, count );

#endif

 time_t e = clock();

 avg += ( e - s );

 std::cout << "Time: " << ( e - s ) << " Avg: " << avg / ( double( n ) + 1.0 ) << std::endl;

 int val = dest[ 0 ];

 for( int n = 1; n < count; ++n )
   {
    if ( val > source[ n ] ) { std::cout << "Not sorted" << std::endl; break; }
    val = source[ n ];
   }

 delete[] source;
 delete[] dest;
 }

 return 0;
}

It is, as the original, a flat file (no header). It is not an attempt at high quality C++, just a re-read/refactor eliminating the use of goto.

Timing is the same on an optimized build under Visual Studio 2019 (64 bit console), but the debug build shows that the alternative version takes much more time. The optimizer makes good use of inlines and various cues, but it runs about the same in release with standard optimization settings.

I believe our elders (of which I'm almost a part) back in the 70's through the end of that era when 1 Gbyte of RAM was unaffordable and optimizers were weak, would have seen a difference like that of the debug build. Our modern compilers make this issue evaporate, as demonstrated here.

I noticed, and can't explain (I didn't check), the sort fails on both functions when volume is at 20 million, so I tested 15 million.

A define at the top, "ORG", is set to use the original version, comment it out to try the alternative version.

On my 4790K @ 4.2 Ghz, the 32 bit release build at 15 million averaged about 960 ms. The 64 bit release average about 780 ms, for both versions.

**Niccolo** · 06-11-2019

@rcgldr,

Originally Posted by rcgldr

... and in 1974, Donald Knuth wrote "Structured Programming with go to Statements"

We wrote our two posts within minutes of each other, and I didn't notice this until later.

I wonder if you read Knuth's paper, though. Here's a relevant section:

Probably the worst mistake any one can make with respect to the subject of go to statements is to assume that “structured programming” is achieved by writing programs as we always have and then eliminating the go to‘s. Most go to‘s shouldn’t be there in the first place! What we really want is to conceive of our program in such a way that we rarely even think about go to statements, because the real need for them hardly ever arises. The language in which we express our ideas has a strong influence on our thought processes. Therefore, Dijkstra asks for more new language features — structures which encourage clear thinking — in order to avoid the go to‘s temptations toward complications.

Now, it is quite true that Knuth spends time in that paper to discuss the use of goto. His view is that of the assembler, and in '74 C was considered a CPU independent assembler. It was initially fashioned to re-write UNIX for portability. UNIX had been written in assembler first, but translated into C when the language was created. Naturally, when translating assembler, the destination language would be required to accept constructs like those found in the source material, and especially when the language was purpose built for that process.

Knuth was acknowledging, however, that improvements in language should suffice so as to allow us to think in terms without explicit jumps. C was new, UNIX was not yet released for the public, and Berkeley hadn't had the chance to take a crack at it.

Every close bracket is an implied jump point. What follows the bracket is an implied join to code surrounding the bracketed block.

What I attempted to show in posting a non-goto reinterpretation is that the resulting code is, arguably, a little more readable, with virtually no performance detriment. What code may appear to repeat is trivial, mostly copying parts of the pointer pair I chose to place in msptr.

I can't say I'm convinced on the case that goto is necessary for speed, or that it helps structure code. I've not used it in nearly 40 years, and didn't have a motivation to.

**rcgldr** · 06-12-2019

Originally Posted by Niccolo

I thought I'd have a go at this.

Thanks

Originally Posted by Niccolo

The sort fails on both functions when volume is at 20 million, so I tested 15 million.

The code in main needs to check the array pointed to by the returned pointer in "result" (instead of dest or source). To end up with the sorted array in the original array (not the working array, the code could calculate the number of passes for a given count, and if the number of passes would be odd, do an in place insertion sort 4 elements at at time at the start, so that run size (rsz) starts at 4, but this code was being used to compare pure 4-way versus pure 2-way (also to show a minheap based 4 way merge sort was much slower). A hybrid insertion / merge sort is faster, with the first pass creating runs of either 16 or 64 to keep the merge sort pass count even.

Originally Posted by Niccolo

On my 4790K @ 4.2 Ghz, the 32 bit release build at 15 million averaged about 960 ms. The 64 bit release average about 780 ms, for both versions.

The code is intended for a processor with 16 registers, so it should be a 64 bit release build, in order to get the 16 registers.

On my system, 3770K @ 3.5 Ghz, VS2015, Win 7 Pro 64 bit, the 64 bit release build, with the changes noted below (VS rand only returns 15 bit values), it took about 1.17 seconds with original (leaving the merge201 after the copy?), 1.18 seconds with alternate, less than 1% difference. As you posted, compiler optimization is doing a good job dealing with the inline functions. The 4-way merge code is old code, based on stuff from the 1980's, when compilers weren't that code (had to include "register" in the local variable declarations to get the compiler to use registers).

I'm wondering about this part of the code:

Code:

         while(pax < &a[n])
         {
             p[0].e = rsz + (p[0].r = pax);
             if(p[0].e >= &a[n])
             {
                 p[0].e = &a[n];
                 MergeCpy10( p, pax, pbx, rsz );
                 Merge201( p, pax, pbx, rsz );    // ???
                 continue;
             }

Seems like that call to Merge201 is unneeded. I took it out, and the code still works, but it's slightly slower, which could be an issue with where code is located, as the 3770K seems to be sensitive to where tight loops are located.

I changed main to use a third array for a check. I copied the unordered source into an array named chk, then used std::sort on chk, and then compared chk with result to make sure elements weren't lost. With Visual Studio, rand() only returns 15 bit values, so I called it 4 times, masking and shifting the results to create pseudo random 32 bit integers.

**rcgldr** · 06-25-2019

Originally Posted by Niccolo

Code:

         if(p[0].e >= &a[n])
         {
             p[0].e = &a[n];
             MergeCpy10( p, pax, pbx, rsz );
             Merge201( p, pax, pbx, rsz );     // ?? this doesn't appear to be needed
             continue;
         }

I never got a reply about the call to Merge201 after doing a MergeCopy10.

I'm also wondering if the original version is easier to follow than the alternate version. The original version consistently progresses downwards towards the end of the loop for the merge operations as you drop from 4-way to 3-way to 2-way to copy, while the alternate version requires jumping back to see the inlined functions. Also I had the timing wrong, on my system, original averages 1.59 seconds, alternate averages 1.24 seconds, about 7% slower.

**Niccolo** · 06-26-2019

I never thought about it after our last exchange. I'll look to see if I still have that code. I usually keep all that I write, but on these forums I just grab an example console project, erase what's there and post some thread's example code.

As best I can recall I was just calling the common functions based on the goto labels, and noticed that a few times they were duplicates, but I don't actually recall. It may have been an error left behind.

It may take a while, or a re-try.

**rcgldr** · 06-26-2019

Originally Posted by Niccolo

- from rcgldr - noting a call to Merge201 after doing a MergeCopy10.

This code is used to handle the case where a single sorted run is all that is remaining to the end of the array, so it just does a copy of that single run. Calling Merge201 after that looks like it moves at least 2 elements beyond the end of the array.

**rcgldr** · 06-27-2019

Originally Posted by Niccolo

I'll look to see if I still have that code.

Can't you use the code from your prior post? I made a different version of main, calls rand() multiple times since VS rand() only returns 15 bits (it also matches some other old code used for testing), and a third array used to check the result (to make sure no elements are dropped).

Code:

int main()
{                           
 int *source;
 int *dest;
 int *chk;                              // added this for verify
 int r;                                 // added for random data

// srand( (unsigned int) time( NULL ) );   // wanted same pattern

 const int count = 15000000;

 double avg{ 0.0 };

 for( int n = 0; n < 10; ++n )   // 10 seemed good enough
 {
 source = new int[ count ];
 dest   = new int[ count ];
 chk    = new int[ count ];

 for( int n = 0; n < count; ++n )
   {
    r  = (((uint64_t)((rand()>>4) & 0xff))<< 0);
    r += (((uint64_t)((rand()>>4) & 0xff))<< 8);
    r += (((uint64_t)((rand()>>4) & 0xff))<<16);
    r += (((uint64_t)((rand()>>4) & 0xff))<<24);
    chk[n] = source[n] = r;
    dest[ n ] = 0;
   }

 std::sort(chk+0, chk+count);

 time_t s = clock();

#ifdef ORG

int * result = BottomUpMergeSortOrg( source, dest, count );

#else

int * result = BottomUpMergeSortAlt( source, dest, count );

#endif

 time_t e = clock();

 avg += ( e - s );

 std::cout << "Time: " << ( e - s ) << " Avg: " << avg / ( double( n ) + 1.0 ) << std::endl;

 for( int n = 0; n < count; ++n )
    if ( chk[n] != result[ n ] ) { std::cout << "Not sorted" << std::endl; break; }

 delete[] chk;
 delete[] dest;
 delete[] source;
 }


 return 0;
}

**rcgldr** · 07-22-2019

Originally Posted by Niccolo

On my 4790K @ 4.2 Ghz, ... 64 bit release average about 780 ms, for both versions.

My prior testing was wrong, on my 3770K at 3.5ghz, 64 bit mode, original takes ~1.18, alternate ~1.26 seconds, about 6.6% slower, not much of a difference.

**Niccolo** · 08-03-2019

I nearly forgot that I said I'd return to this, and it took a rainstorm blocking me from doing other things to notice this particular oversight.

So, I thought I'd check, and indeed when I went through the first time, fashioning functions out of snippets, I did add an unnecessary call to that one clause.

Then, prompted by a recent update to Visual Studio 2019, I decided to try something. Without otherwise changing the code (just removing that call), I built with optimizations wide open on MSVC. That is, not only favoring speed, but allowing for "any suitable" inline, enabled intrinsic functions, eliminating frame pointers, etc.

Then, I did the same for Clang 8.0.

The result is curious. For MS's compiler, the original you posted, testing 15 million at 100 runs, average around 775, but the build I proposed (minus that errant call) turned in about 798. I ran these a few times, MS hit one run as long as 810, one run as fast as 780. The original turned in one run at 771, the rest around 776, 775.

When I switched to Clang, turning on the related optimization options, the original you posted turned in 791 as its fastest, one at 800. The version I proposed turned in 1 run at 770, and others at 772, 771 and 774 (these are averages of 100 tests each).

This means that Clang didn't build the original quite as fast as MSVC did, but built the proposed version measurably faster than MSVC.

GCC under linux met in the middle, but my GCC is in a Linux VM. Even though the machine is quiet I wouldn't think that data is all that pertinent.

What this shows is that the particulars of the optimizer is quite key, and it would support the "wisdom" published for many years about optimization, in this era of optimizing compilers, the basics of which is to trust optimizers and focus on readable code. Of course, Microsoft's optimizer doesn't quite "shine" in this example, it did "ok", but this comparison made it clear it could do better.

I expected that, too. Both Clang and GCC are open source, far more attended by academia and considerable public interest (where "public" is defined as all those companies other than Microsoft who contribute to open source development of these two compilers). Certainly Apple's interest in Clang/LLVM would benefit that compiler, and GCC is historic in it's roots (though much older fashioned in source code style).

The proprietary efforts of Microsoft prove to compare poorly, and that was the "promise" of opens source tools of this type. That does not quite translate into better IDE's or photo editing software, but for something like compilers and operating systems (GUI aside), open source can obviously do what is claimed.

One thing this had prompted me to do is switch most of my development work to the Clang compiler when working on VS2019.

**rcgldr** · 08-04-2019

Originally Posted by Niccolo

... MSVC ... Clang 8.0 ...

One issue I found with 3770K was an unusual sensitivity to code location in tight loops. I was comparing 64 bit assembly code for Fibonacci number using a tight loop versus an unfolded loop, calculating the sum of fib(0) to fib(93), 1048576 (2^20) times, for a benchmark. To reduce cache hits for the unfolded version, the input to fib() was multiples of 37 % 93, + 93 = {0,37,74,18,...,75,19,56,93} (94 values).

For the tight loop version, padding to even / odd multiple 16 byte boundaries created a range of run time from 1.465 to 2.000 seconds, a big difference where the only difference was padding nops between functions to put those functions at even / odd 16 byte boundaries. The unfolded version, using a computed jump into the unfolded code (similar to an optimized switch case), was barely affected by the padding, 1.042 seconds best case, 1.048 seconds worst case. If interested, I can create another thread with the example code.

**I C everything** · 08-05-2019

Originally Posted by rcgldr

For the tight loop version, padding to even / odd multiple 16 byte boundaries created a range of run time from 1.465 to 2.000 seconds, a big difference where the only difference was padding nops between functions to put those functions at even / odd 16 byte boundaries. The unfolded version, using a computed jump into the unfolded code (similar to an optimized switch case), was barely affected by the padding, 1.042 seconds best case, 1.048 seconds worst case. If interested, I can create another thread with the example code.

please post that,isnt it possible to put the function that keeps the innerloop first after an alignas?
wouldnt it matter also with 64byte boundary for innerloop?

Thread: Sometimes "goto" is useful

Thread Tools

Search Thread

Display

Sometimes "goto" is useful

Similar Threads

Problem in template that uses "decltype", "typename", and "std::remove_reference"

How can I use "goto" for group of multiple statements?

"goto" question

"itoa"-"_itoa" , "inp"-"_inp", Why some functions have "

"CWnd"-"HWnd","CBitmap"-"HBitmap"...., What is mean by "

Tags for this Thread