Thread: A little optimization challenge.

  1. #1
    Registered User
    Join Date
    Sep 2020
    Posts
    122

    A little optimization challenge.

    This challenge is inspired by C-UL8R's thread at How can I improve/shorten my C code?

    The challenge is to count the number of vowels and consonants in the text file that contains the works of William Shakespeare. The text file can be grabbed from https://ocw.mit.edu/ans7870/6/6.006/...hakespeare.txt

    Here's the reference implementation:

    Code:
    /***********************************************
    * charcount.c : count consonants and vowels in a file
    ***********************************************/
    #include <stdio.h>
    #include <stdint.h>
    #include <ctype.h>
    #include <sys/time.h>
    
    
    static char fname[] = "t8.shakespeare.txt";
    
    
    static int isVowel(int c) {
      c = tolower(c);
      return (c == 'a') || (c == 'e') || (c == 'i') || (c == 'o') || (c == 'u');
    }
    
    
    static int isConsonant(int c) {
      if(isVowel(c)) return 0;
      return isalpha(c);
    }
    
    
    int main(int argc, char *argv[]) {
       struct timeval start, end;
       uint32_t vowelCount = 0;
       uint32_t conCount = 0;
    
       gettimeofday(&start, NULL);
    
       FILE *f = fopen(fname,"r");
       if(f == NULL) {
          printf("Unable to open %s\n",fname);
          printf("\n");
          printf("Please download the test file from \n");
          printf("https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt\n");
          return 0;
       }
    
       //=========== Start of challenge code =========
       int c = getc(f);
       while(c != EOF) {
          if(isVowel(c)) {
             vowelCount++;
          }
          if(isConsonant(c)) {
             conCount++;
          }
          c = getc(f);
       }
       //=========== End of challenge code =========
    
       fclose(f);
    
       gettimeofday(&end, NULL);
    
       printf("There are %u consonants and %u vowels in the Complete Works of William Shakespeare\n",conCount, vowelCount);
       printf("Time taken = %li microseconds\n", (long int)(end.tv_sec-start.tv_sec)*1000000+(end.tv_usec-start.tv_usec));
    }
    Of course you can add as many functions or other code as you like. Code size or memory usage isn't a concern, it just has to be portable C code.

    Scoring will be a honesty system, and just be a "this is X times better than the reference code on my machine". It is expected that you will compile both code with the same set of compiler options, with optimizations turned on.

    e.g. On my i3-3110M laptop the above code takes 98733 microseconds to run. If I write an improved version that takes 12345 microseconds to run it is 98733/12345 = 7.99x the speed of the original.

    For completeness sake, here is the answer I get - I hope it's right!

    Code:
    There are 2357034 consonants and 1433891 vowels in the Complete Works of William Shakespeare
    Time taken = 98733 microseconds
    I'm guessing a 5x is easily possible, a 10x speedup should be achievable, but I'll be very surprised if anybody gets a 20x speedup without resorting to using multiple threads or CPU specific features.

    I'll post my best version in about a week...

  2. #2
    misoturbutc Hodor's Avatar
    Join Date
    Nov 2013
    Posts
    1,795
    With compiler optimisations turned on or off?

  3. #3
    Registered User
    Join Date
    Sep 2020
    Posts
    122
    Quote Originally Posted by Hodor View Post
    With compiler optimisations turned on or off?
    As per the OP:

    It is expected that you will compile both code with the same set of compiler options, with optimizations turned on.

  4. #4
    misoturbutc Hodor's Avatar
    Join Date
    Nov 2013
    Posts
    1,795
    Ok, as a start I'll begin with this (no compiler optimisations)

    Code:
       //=========== Start of challenge code =========
       
       int c;
       while((c = getc(f)) != EOF) {
            if (isalpha(c)) {
               int lc = tolower(c);
               int iv = lc == 'e' || lc == 'a' || lc == 'o' || lc == 'i' || lc == 'u';
               vowelCount += iv;
               conCount += !iv;
            }
       }
       //=========== End of challenge code =========
    The original code:

    Code:
    $./a.out 
    There are 2357034 consonants and 1433891 vowels in the Complete Works of William Shakespeare
    Time taken = 102125 microseconds
    My modified code:

    Code:
    $ ./a.out 
    There are 2357034 consonants and 1433891 vowels in the Complete Works of William Shakespeare
    Time taken = 32977 microseconds
    Only optimisation I made, really, is ordering the checks for a vowel based on English vowel frequency; i.e. I'm not concerned about much, right now, apart from how often vowels appear in typical English text. So, no real code changes but an improvement.

    Edit: So I guess I didn't optimise it at all. Or did I?

    Edit2: meh... compiler optimisations were turned off. My bad. My non-code-related changes are still faster with -O2 though.
    Last edited by Hodor; 10-02-2020 at 04:38 AM.

  5. #5
    Registered User
    Join Date
    Feb 2019
    Posts
    763
    My best shot, so far.

    The 'original' is the code above. 'count' is mine. Pure C, portable (I think)...

    I'll post when you post yours!

    Code:
    $ ./original; echo; ./count
    Cycles: 223980894.
    There are 2357034 consonants and 1433891 vowels in the Complete Works of William Shakespeare
    Time taken = 66048 μs
    
    Cycles: 22487090.
    There are 2357034 consonants and 1433891 vowels in the Complete Works of William Shakespeare.
    Time: 6671 μs.
    Last edited by flp1969; 10-02-2020 at 06:45 AM.

  6. #6
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,076
    Quote Originally Posted by flp1969
    I'll post when you post yours!
    Clearly, you need a code escrow service
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  7. #7
    Registered User
    Join Date
    Feb 2019
    Posts
    763
    PS: Improved the code a little more... now it is 10x faster.

  8. #8
    Registered User
    Join Date
    Feb 2019
    Posts
    763
    Ok... here it is my version:
    Code:
    #include <stdio.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <stdlib.h>
    #include <sys/time.h>
    
    #ifdef DEBUG
    # include <inttypes.h>
    # include "cycle_counting.h"
    #endif
    
    #define FILENAME      "t8.shakespeare.txt"
    
    // Ideal buffer size?!
    // I got the best results for disk I/O with this size.
    #define BUFFER_SIZE   16384
    
    // is_vowel() and is_consonant() assumes ASCII.
    // Technically this is not "portable", since we could've been dealing
    // with EBCDIC, for example!
    // But I think is "portable" enough.
    //
    // Also, don't challenge the default word size because using
    // _Bool will slow things down.
    static int is_vowel( unsigned char c )
    {
      // 1 KiB array to avoid "special" characters used in
      // user's system (works with UTF-8 too).
      static const int vowels[] =
      {
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
      };
    
      return vowels[c];
    }
    
    static int is_consonant( unsigned char c )
    {
      // 1 KiB array to avoid "special" characters used in
      // user's system (works with UTF-8 too).
      static const int consonants[] =
      {
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
        1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
        1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
      };
    
      return consonants[c];
    }
    
    int main( void )
    {
      static char buffer[BUFFER_SIZE];
      char *p;
      int fd;
      ssize_t size;
      unsigned long int v, c;   // 'long int' is 64 bits long on x86-64.
                                // I could use 'unsigned int' to speed things up, but
                                // I don't know if the counts would overflow.
      struct timeval start, end;
    
      gettimeofday( &start, NULL );
    
    #ifdef DEBUG
      counter_T c1;
    
      c1 = BEGIN_TSC();
    #endif
    
      // NOTE: Using file descriptors and syscalls
      //       because FILE streams are too slow.
      if ( ( fd = open( FILENAME, O_RDONLY ) ) < 0 )
      {
        fputs( "ERROR opening file.\n", stderr );
        return EXIT_FAILURE;
      }
    
      v = c = 0;
      while ( ( size = read( fd, buffer, BUFFER_SIZE ) ) > 0 )
      {
        p = buffer;
    
        while ( size-- )
        {
          // NOTE: 'else' is necessary not to do the test twice!
          //       And since we have more consonants than vowels (I think),
          //       the order of these tests are important.
          if ( is_consonant( *p ) )
            c++;
          else
            if ( is_vowel( *p ) )
              v++;
    
          p++;
        }
      }
    
      close( fd );
    
    #ifdef DEBUG
      END_TSC( &c1 );
      printf( "Cycles: %" PRIu64 ".\n", c1 );
    #endif
    
      gettimeofday( &end, NULL );
    
      printf( "There are %lu consonants and %lu vowels in the Complete Works of William Shakespeare.\n", c, v );
      printf( "Time: %ld μs.\n", ( long int )( end.tv_sec - start.tv_sec ) * 1000000 + ( end.tv_usec - start.tv_usec ) );
    
      return EXIT_SUCCESS;
    }
    cycle_counting.h is this one:
    Code:
    #ifndef CYCLE_COUNTING_INCLUDED__
    #define CYCLE_COUNTING_INCLUDED__
    
    #ifndef __GNUC__
    # error Works only on GCC
    #endif
    
    /* ==========================================
        Quick & Dirty cycle counting...
    
        As funções usadas para contar a quantidade de ciclos
        de clock gastos num bloco de código.
    
        Exemplo de uso:
    
          counter_T cnt;
    
          cnt = BEGIN_TSC();      
          f();
          END_TSC(&cnt);
    
        Defina SYNC_MEM se quiser uma serialização mais completa (mfence).
       ========================================== */
    #include <stdint.h>
    
    // É necessário usar esse "tipo" para os contadores
    // porque o compilador vai tentar otimizar qualquer variável
    // não volátil.
    typedef volatile uint64_t counter_T;
    
    #if defined(__x86_64__)
      inline counter_T BEGIN_TSC( void )
      {
        uint32_t a, d;
    
        __asm__ __volatile__ (
      #ifdef SYNC_MEM
          "mfence\n\t"
      #endif
          "cpuid\n\t"
          "rdtsc" : "=a" (a), "=d" (d) : "a" (0) : "rbx", "rcx"
        );
    
        return a | ((uint64_t)d << 32);
      }
    
      inline void END_TSC( counter_T *cptr )
      {
        uint32_t a, d;
    
        __asm__ __volatile__ (
          "rdtscp" : "=a" (a), "=d" (d) :: "rcx"
        );
    
        *cptr = (a | ((uint64_t)d << 32)) - *cptr;;
      }
    #elif defined(__i386__)
      inline counter_T BEGIN_TSC( void )
      {
        uint32_t a, d;
    
        __asm__ __volatile__ (
      // Só usa mfence se o processador suportar SSE2.
      #ifdef __SSE2__
      #  ifdef SYNC_MEM
          "mfence\n\t"
      #  endif
      #endif
          "cpuid\n\t"
          "rdtsc" : "=a" (a), "=d" (d) : "a" (0) : "ebx", "ecx"
        );
    
        return a | ((uint64_t)d << 32);
      }
    
      inline void END_TSC( counter_T *cptr )
      {
        uint32_t a, d;
    
        __asm__ __volatile__ (
          "rdtscp" : "=a" (a), "=d" (d) :: "ecx"
        );
    
        *cptr = (a | ((uint64_t)d << 32)) - *cptr;
      }
    #elif __arm__
      #if __ARM_32BIT_STATE == 1
        // This works only on ARMv8
        #if __ARM_ARCH__ > 7
          inline counter_T BEGIN_TSC( void )
          {
            unsigned int r0, r1;
    
            __asm__ __volatile__ (
          #ifdef SYNC_MEM
              "dmb\n\t"
          #endif
              "mrrc p15,1,%0,%1,c14"        // FIXME: Must check this.
                                            //        Get the virtual counter.
              : "=r" (r0), "=r" (r1)
            );
    
            return ((uint64_t)r1 << 32) | r0;
          }
    
          inline void END_TSC( counter_T *cptr )
          {
            unsigned int r0, r1;
    
            __asm__ __volatile__ (
              "mrrc p15,1,%0,%1,c14"      // FIXME: Must check this.
                                          //        get the virtual counter.
              : "=r" (r0), "=r" (r1)
            );
    
            *cptr = (((uint64_t)r1 << 32) | r0) - *cptr;
          }
        #else
          #error ARMv8 or superior only.
        #endif
      #else   // otherwise we are in aarch64 mode.
        inline counter_T BEGIN_TSC( void )
        {
          uint64_t count;
    
          __asm__ __volatile__ ( 
        #ifdef SYNC_MEM
          "dmb\n\t"
        #endif
          "mrs %0,cntvct_el0" : "=r" (count) );
    
          return count;
        }
    
        inline void END_TSC( counter_T *cptr )
        {
          uint64_t count;
    
          __asm__ __volatile__ ( "mrs %0,cntvct_el0" : "=r" (count) );
    
          *cptr = count - *cptr;
        }
      #endif
    #else
    # error i386, x86-64 and ARM only.
    #endif
    
    #endif
    And here's the Makefile:
    Code:
    # You said otimizations are allowed!!!
    CFLAGS=-O2 -march=native -mtune=native -ftree-vectorize -fno-stack-protector
    LDFLAGS=-s
    
    .PHONY: all clean distclean
    
    all: original count
    
    original: original.o
        $(CC) $(LDFLAGS) -o $@ $^
    
    count: count.o
        $(CC) $(LDFLAGS) -o $@ $^
    
    count.o: count.c cycle_counting.h
    original.o: original.c cycle_counting.h
    
    clean:
        -rm *.o
    
    distclean: clean
        rm original count
    Of course, I added the cycle counting on your original code as well.. Just don't define DEBUG and the code above is pure C.
    Last edited by flp1969; 10-02-2020 at 07:05 AM.

  9. #9
    misoturbutc Hodor's Avatar
    Join Date
    Nov 2013
    Posts
    1,795
    Quote Originally Posted by flp1969 View Post
    Ok... here it is my version:
    Code:
    #include <stdio.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <sys/time.h>
    
    #ifdef DEBUG
    # include <inttypes.h>
    # include "cycle_counting.h"
    #endif
    
    #define FILENAME      "t8.shakespeare.txt"
    
    // Ideal buffer size?!
    // I got the best results for disk I/O with this size.
    #define BUFFER_SIZE   16384
    
    // is_vowel() and is_consonant() assumes ASCII.
    // Technically this is not "portable", since we could've been dealing
    // with EBCDIC, for example!
    // But I think is "portable" enough.
    //
    // Also, don't challenge the default word size because using
    // _Bool will slow things down.
    static int is_vowel( unsigned char c )
    {
      // 1 KiB array to avoid "special" characters used in
      // user's system (works with UTF-8 too).
      static const int vowels[] =
      {
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
      };
    
      return vowels[c];
    }
    
    static int is_consonant( unsigned char c )
    {
      // 1 KiB array to avoid "special" characters used in
      // user's system (works with UTF-8 too).
      static const int consonants[] =
      {
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
        1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
        1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
      };
    
      return consonants[c];
    }
    
    int main( void )
    {
      static char buffer[BUFFER_SIZE];
      char *p;
      int fd;
      ssize_t size;
      unsigned long int v, c;   // 'long int' is 64 bits long on x86-64.
                                // I could use 'unsigned int' to speed things up, but
                                // I don't know if the counts would overflow.
      struct timeval start, end;
    
      gettimeofday( &start, NULL );
    
    #ifdef DEBUG
      counter_T c1;
    
      c1 = BEGIN_TSC();
    #endif
    
      // NOTE: Using file descriptors and syscalls
      //       because FILE streams are too slow.
      fd = open( FILENAME, O_RDONLY );
    
      v = c = 0;
      while ( ( size = read( fd, buffer, BUFFER_SIZE ) ) > 0 )
      {
        p = buffer;
    
        while ( size-- )
        {
          // NOTE: 'else' is necessary not to do the test twice!
          //       And since we have more consonants than vowels (I think),
          //       the order of these tests are important.
          if ( is_consonant( *p ) )
            c++;
          else
            if ( is_vowel( *p ) )
              v++;
    
          p++;
        }
      }
    
      close( fd );
    
    #ifdef DEBUG
      END_TSC( &c1 );
      printf( "Cycles: %" PRIu64 ".\n", c1 );
    #endif
    
      gettimeofday( &end, NULL );
    
      printf( "There are %lu consonants and %lu vowels in the Complete Works of William Shakespeare.\n", c, v );
      printf( "Time: %ld μs.\n", ( long int )( end.tv_sec - start.tv_sec ) * 1000000 + ( end.tv_usec - start.tv_usec ) );
    
      return 0;
    }
    cycle_counting.h is this one:
    Code:
    #ifndef CYCLE_COUNTING_INCLUDED__
    #define CYCLE_COUNTING_INCLUDED__
    
    #ifndef __GNUC__
    # error Works only on GCC
    #endif
    
    /* ==========================================
        Quick & Dirty cycle counting...
    
        As funções usadas para contar a quantidade de ciclos
        de clock gastos num bloco de código.
    
        Exemplo de uso:
    
          counter_T cnt;
    
          cnt = BEGIN_TSC();      
          f();
          END_TSC(&cnt);
    
        Defina SYNC_MEM se quiser uma serialização mais completa (mfence).
       ========================================== */
    #include <stdint.h>
    
    // É necessário usar esse "tipo" para os contadores
    // porque o compilador vai tentar otimizar qualquer variável
    // não volátil.
    typedef volatile uint64_t counter_T;
    
    #if defined(__x86_64__)
      inline counter_T BEGIN_TSC( void )
      {
        uint32_t a, d;
    
        __asm__ __volatile__ (
      #ifdef SYNC_MEM
          "mfence\n\t"
      #endif
          "cpuid\n\t"
          "rdtsc" : "=a" (a), "=d" (d) : "a" (0) : "rbx", "rcx"
        );
    
        return a | ((uint64_t)d << 32);
      }
    
      inline void END_TSC( counter_T *cptr )
      {
        uint32_t a, d;
    
        __asm__ __volatile__ (
          "rdtscp" : "=a" (a), "=d" (d) :: "rcx"
        );
    
        *cptr = (a | ((uint64_t)d << 32)) - *cptr;;
      }
    #elif defined(__i386__)
      inline counter_T BEGIN_TSC( void )
      {
        uint32_t a, d;
    
        __asm__ __volatile__ (
      // Só usa mfence se o processador suportar SSE2.
      #ifdef __SSE2__
      #  ifdef SYNC_MEM
          "mfence\n\t"
      #  endif
      #endif
          "cpuid\n\t"
          "rdtsc" : "=a" (a), "=d" (d) : "a" (0) : "ebx", "ecx"
        );
    
        return a | ((uint64_t)d << 32);
      }
    
      inline void END_TSC( counter_T *cptr )
      {
        uint32_t a, d;
    
        __asm__ __volatile__ (
          "rdtscp" : "=a" (a), "=d" (d) :: "ecx"
        );
    
        *cptr = (a | ((uint64_t)d << 32)) - *cptr;
      }
    #elif __arm__
      #if __ARM_32BIT_STATE == 1
        // This works only on ARMv8
        #if __ARM_ARCH__ > 7
          inline counter_T BEGIN_TSC( void )
          {
            unsigned int r0, r1;
    
            __asm__ __volatile__ (
          #ifdef SYNC_MEM
              "dmb\n\t"
          #endif
              "mrrc p15,1,%0,%1,c14"        // FIXME: Must check this.
                                            //        Get the virtual counter.
              : "=r" (r0), "=r" (r1)
            );
    
            return ((uint64_t)r1 << 32) | r0;
          }
    
          inline void END_TSC( counter_T *cptr )
          {
            unsigned int r0, r1;
    
            __asm__ __volatile__ (
              "mrrc p15,1,%0,%1,c14"      // FIXME: Must check this.
                                          //        get the virtual counter.
              : "=r" (r0), "=r" (r1)
            );
    
            *cptr = (((uint64_t)r1 << 32) | r0) - *cptr;
          }
        #else
          #error ARMv8 or superior only.
        #endif
      #else   // otherwise we are in aarch64 mode.
        inline counter_T BEGIN_TSC( void )
        {
          uint64_t count;
    
          __asm__ __volatile__ ( 
        #ifdef SYNC_MEM
          "dmb\n\t"
        #endif
          "mrs %0,cntvct_el0" : "=r" (count) );
    
          return count;
        }
    
        inline void END_TSC( counter_T *cptr )
        {
          uint64_t count;
    
          __asm__ __volatile__ ( "mrs %0,cntvct_el0" : "=r" (count) );
    
          *cptr = count - *cptr;
        }
      #endif
    #else
    # error i386, x86-64 and ARM only.
    #endif
    
    #endif
    And here's the Makefile:
    Code:
    # You said otimizations are allowed!!!
    CFLAGS=-O2 -march=native -mtune=native -ftree-vectorize -fno-stack-protector
    LDFLAGS=-s
    
    .PHONY: all clean distclean
    
    all: original count
    
    original: original.o
        $(CC) $(LDFLAGS) -o $@ $^
    
    count: count.o
        $(CC) $(LDFLAGS) -o $@ $^
    
    count.o: count.c
    original.o: original.c
    
    clean:
        -rm *.o
    
    distclean: clean
        rm original count
    Of course, I added the cycle counting on your original code as well.. Just don't define DEBUG and the code above is pure C.

    That's stupid, flp1969. You have branches (e.g. if ( is_vowel( *p ) )) in that code so it's just... stupid.

  10. #10
    Registered User
    Join Date
    Feb 2019
    Posts
    763
    Quote Originally Posted by Hodor View Post
    That's stupid, flp1969. You have branches (e.g. if ( is_vowel( *p ) )) in that code so it's just... stupid.
    Well... do better then...

  11. #11
    misoturbutc Hodor's Avatar
    Join Date
    Nov 2013
    Posts
    1,795
    Quote Originally Posted by flp1969 View Post
    Well... do better then...

    I don't think I can

  12. #12
    misoturbutc Hodor's Avatar
    Join Date
    Nov 2013
    Posts
    1,795
    The point I was trying to make was that optimisations do not necessarily come from "code"; they can come from other places (i.e. the order you compare aeiou can make a huge difference because of "short circuiting"). The order things are checked is, in my opinion, can be more important than the code itself. I.e. I've trivialised it but the algorithm (and in this instance the assumptions it makes) is often more improtant than the code.


    That said, your code is pretty impressive.

  13. #13
    Registered User
    Join Date
    Feb 2019
    Posts
    763
    Quote Originally Posted by Hodor View Post
    That said, your code is pretty impressive.
    Thanks, there is room to improvement yet.

    To avoid that "branch" I could've done something like this:
    Code:
    c += is_consonant( *p );
    v += is_vowel( *p );
    But we'll gain, if something, just about a clock cycle or two if the static functions are to be compiled inline. Not a very good optimization.

    I changed 3 things in the original code: Got rid of FILE and stdio file manipulation because on some standard libraries they use mutexes (even with a single thread) - this is the case of glibc.
    Used a empirically sized buffer to speed up disk I/O.
    And, of course, avoided comparisons of ASCII chars using those tables.

    These changes are not "code" related, I think.

    Another step that could be taken is comparing MORE then one char at a time (4, as in sizeof(int)), but I thought this could lead to a complex and hard to understand code.

    []s
    Fred

  14. #14
    misoturbutc Hodor's Avatar
    Join Date
    Nov 2013
    Posts
    1,795
    Quote Originally Posted by flp1969 View Post
    Thanks, there is room to improvement yet.

    To avoid that "branch" I could've done something like this:
    Code:
    c += is_consonant( *p );
    v += is_vowel( *p );
    But we'll gain, if something, just about a clock cycle or two if the static functions are to be compiled inline. Not a very good optimization.

    I changed 3 things in the original code: Got rid of FILE and stdio file manipulation because on some standard libraries they use mutexes (even with a single thread) - this is the case of glibc.
    Used a empirically sized buffer to speed up disk I/O.
    And, of course, avoided comparisons of ASCII chars using those tables.

    These changes are not "code" related, I think.

    Another step that could be taken is comparing MORE then one char at a time (4, as in sizeof(int)), but I thought this could lead to a complex and hard to understand code.

    []s
    Fred

    I was being sarcastic when I said "stupid" (Ok, maybe "sarcastic" is not the right word, I'm not sure, but I was certainly having a go at another user who thinks avoiding branches is the way to go)

  15. #15
    misoturbutc Hodor's Avatar
    Join Date
    Nov 2013
    Posts
    1,795
    Quote Originally Posted by flp1969 View Post
    Thanks, there is room to improvement yet.

    To avoid that "branch" I could've done something like this:
    Code:
    c += is_consonant( *p );
    v += is_vowel( *p );
    But we'll gain, if something, just about a clock cycle or two if the static functions are to be compiled inline. Not a very good optimization.

    I changed 3 things in the original code: Got rid of FILE and stdio file manipulation because on some standard libraries they use mutexes (even with a single thread) - this is the case of glibc.
    Used a empirically sized buffer to speed up disk I/O.
    And, of course, avoided comparisons of ASCII chars using those tables.

    These changes are not "code" related, I think.

    Another step that could be taken is comparing MORE then one char at a time (4, as in sizeof(int)), but I thought this could lead to a complex and hard to understand code.

    []s
    Fred
    As impressive as your code is (I can't beat it) do you no agree that the order of evaluation of vowels can make a huge difference? Your code makes the order irrelevant of course but, in general, I think that stuff like this makes a difference. In my opinion algorithms make the biggest difference rather than code. Let's face it, this whole thread is (probably) more about the stuff you did and the stuff I did (changing the order of comparisons) than about the stupid sht that that other person[1] is suggesting.

    [1] awsdert suggesting that you must always avoid branches (whatever he means by that). Even back in my demo days writing tight code for Amiga demos/intros because the code we wrote really did have to be tight and fast we didn't think about things like he suggested; I mean... goto is a branch ffs... is a language Turing complete without branches?

    Edit: And, yes, awsdet... I wrote those things in 68k ASM and then later when I moved on x86 asm.
    Last edited by Hodor; 10-02-2020 at 09:27 AM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Need optimization
    By sc7 in forum C++ Programming
    Replies: 22
    Last Post: 11-10-2018, 11:11 AM
  2. Optimization challenge
    By Zacariaz in forum C Programming
    Replies: 1
    Last Post: 06-02-2015, 11:00 AM
  3. Optimization
    By lach in forum C Programming
    Replies: 4
    Last Post: 03-18-2006, 12:08 PM
  4. optimization
    By krithi in forum C Programming
    Replies: 9
    Last Post: 01-19-2003, 10:52 AM
  5. IDE optimization
    By Traveller in forum A Brief History of Cprogramming.com
    Replies: 1
    Last Post: 07-04-2002, 02:01 AM

Tags for this Thread