Thread: Loop Unrolling/Block Copy

  1. #1
    Registered User
    Join Date
    Dec 2011
    Posts
    795

    Loop Unrolling/Block Copy

    My function (to replace instances of "find" with "rep" and copy):
    Code:
    void new_str_rep(char *dst, const char *src, const char *find, const char *rep)
    {
        if (!src || !find)
            return;
    
        size_t i;
    
        while (*src) {
            for (i = 0; find[i] == src[i] && find[i]; ) {
                i++; 
            }
            if (!find[i]) {
                src += i;
                for (i = 0; rep[i]; i++) 
                    *dst++ = rep[i];
                continue;
            }
            *dst++ = *src++; 
        }
        *dst = '\0';
    }
    works, but I feel like it could be sped up with some sort of loop unrolling or comparing in larger blocks. Right now, it is very linear and will spend a significant amount of time in the first loop, as each iteration has to check two things per byte.

    A simple re-write in assembly actually speeds it up, but only because it uses registers However, compiler optimization makes the same changes with the same results, producing relatively similar code.

    Does anyone know of any tricks I could use to make the current code faster, or replace slower segments with ones that are more efficient?

  2. #2
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    Any reason you're concerned about what's probably going to amount to a relatively small speed up? The best the algorithm can do is roughly O(strlen(src) * strlen(find)), so you're squeezing cycles out of this, not orders of magnitude. I'm sure you're aware of this, but I'm wondering if the speedup will justify the effort and whether cranking up the optimization settings on the compiler wont be your best option.

    The only advice I have is don't write your own strcmp/memcmp. It's not likely to be any faster than the standard library versions, which are typically highly optimized for the architecture. They will "unroll" the comparisons for you, checking a whole word at a time if possible (you're comparing enough bytes and it can work something out with alignment/word boundaries), instead of just one byte as your code does. The compiler might inline memcmp and memcpy if it deems it beneficial.
    Code:
    find_len = strlen(find);
    rep_len = strlen(rep);
    while (*src) {
        if (memcmp(src, find, find_len) == 0) {
            memcpy(dst, rep, rep_len)
            src += find_len;
            dst += find_len;
        }
        else {
            *dst++ = *src++;
        }
    }
    *dst = '\0';
    I didn't test that, but that's the general idea.
    Last edited by anduril462; 03-22-2012 at 04:06 PM. Reason: added missing line

  3. #3
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Cost–benefit analysis
    Benefit:
    Code runs x% quicker.

    Costs:
    The code is more complicated, so it took a lot longer to write.
    The code is more complicated, so it has more bugs.
    The code is more complicated, so it takes longer to fix them.
    The optimisation "hacks" result in poorer code generation from the compiler.
    The optimisation "hacks" reduce portability.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  4. #4
    Registered User
    Join Date
    Dec 2011
    Posts
    795
    Hmm. The reason I tried to write my own functions instead of the library ones is because eventually I was striving to make a (faster) assembly port, but if they are truly faster, I suppose a call could solve that. The other reason being that it seems redundant to call two library functions that calculated the same variable. Initially, using strstr(), it was incredibly slow because during every call, the lengths of both of the strings were calculated.

    This brings me to the other question, which will be slower (doesn't matter library or not, the calculations are still being done):
    - Calculating the length at the beginning, and for every iteration copying "rep", checking if the counter is less than the pre-calculated length.
    - Calculating nothing at the beginning, and for every iteration copying "rep", checking if it is zero. It sounds like one less loop, but there is more pointer dereferencing, which is likely slower than a simple integer compare.

    Strcpy() is the optimal solution to the above, being architecture-optimized and not requiring a pre-calculated length, unfortunately, it null terminates, which would take more cycles to remove than it's worth. Any advice as to which option I should choose?

    Edit:
    Quote Originally Posted by Salem View Post
    The code is more complicated, so it took a lot longer to write.
    The code is more complicated, so it has more bugs.
    The code is more complicated, so it takes longer to fix them.
    The optimisation "hacks" result in poorer code generation from the compiler.
    The optimisation "hacks" reduce portability.

    • Sure, it takes longer, but in absence of a deadline, this doesn't really affect the final product.
    • The result of an optimization is not necessarily more complicated code. If it is, however, this is definitely true.
    • Again, only an issue if the optimization introduces complexity.
    • Good point, although the compiler cannot always produce better code than the human.
    • Sometimes, but definitely not the case here (I've never seen a modern computer that lacks the ability to do simple for and while loops).
    Last edited by memcpy; 03-22-2012 at 03:45 PM.

  5. #5
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    Quote Originally Posted by memcpy View Post
    Hmm. The reason I tried to write my own functions instead of the library ones is because eventually I was striving to make a (faster) assembly port, but if they are truly faster, I suppose a call could solve that. The other reason being that it seems redundant to call two library functions that calculated the same variable. Initially, using strstr(), it was incredibly slow because during every call, the lengths of both of the strings were calculated.

    Then I suggest you take a look at the glibc source code, and maybe a disassembly of the library function itself, and see what they do. At least it's a good starting point for a well-optimized version of memcmp, etc. Also, check out this site:
    Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X.

    This brings me to the other question, which will be slower (doesn't matter library or not, the calculations are still being done):
    - Calculating the length at the beginning, and for every iteration copying "rep", checking if the counter is less than the pre-calculated length.
    - Calculating nothing at the beginning, and for every iteration copying "rep", checking if it is zero. It sounds like one less loop, but there is more pointer dereferencing, which is likely slower than a simple integer compare.

    Strcpy() is the optimal solution to the above, being architecture-optimized and not requiring a pre-calculated length, unfortunately, it null terminates, which would take more cycles to remove than it's worth. Any advice as to which option I should choose?
    Honestly, I think memcmp/memcpy wins, at least over str functions, when src is long and there are likely to be many instances of find (if there aren't, then optimization is not a big deal ). The cost of calculating find_len and rep_len up front is likely to be relatively small. Big-O analysis would throw that out. The nice thing about using memcpy with a known length is again, it can work word-by-word, very quickly, until it's all done. If you have to find that zero byte, as with a string function, you're losing out on the word-by-word bonus. Even if you can copy word-by-word, you have to check each word for the existence of a zero byte somewhere in that word and only copy a partial word if you found it in there. That's probably costing you a few more cycles than it's saving you.

  6. #6
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Your interface is inherently unsafe. (The `dst' variable can be overrun in several different ways.) This is particularly problematic because the only way to check that it would be safe is to actually do the finding and counting in advance.

    Your implementation skips a lot of checks I'd consider necessary or at least beneficial. (If the source string is shorter than the target you can safely devolve into a copy.)

    You can also do a little more "bookkeeping" to get faster behavior on average. (It depends on the environment, but simple checking for lengths instead of dereferencing a pointer could be faster.) Yes, I know anduril462 has already suggested this; doing less work in the inner most loop is generally a win so the advice can stand being repeated.

    Your linear search could probably be faster, depending on what the compiler can do and any branch prediction, if you reorder the instructions as assuming a linear copy and breaking when a potential match is found. (That allows you to call `memcpy' on the range instead of a byte by byte copy as in the `else' case from the code posted by anduril462.)

    You'd see the most benefit from a better string searching algorithm especially if you assume that matches are the exception and not the rule. (Also, by breaking your string searching function out of the replace function you buy yourself a clearer implementation by being able to use the return value in `memcpy' logic instead of the weird combined machine you've made.)

    Soma
    Last edited by phantomotap; 03-22-2012 at 04:33 PM.

  7. #7
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    memcpy() will only start copying words at a time if the source and destination are word-aligned to begin with(*). You have no control over the alignment of 'rep', and the alignment of 'dst' will only be word-aligned every 4th character.


    (*) unless you have a rare architecture which supports unaligned transfers with zero performance penalty.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. unrolling strategies
    By m37h0d in forum C++ Programming
    Replies: 14
    Last Post: 06-27-2011, 06:42 AM
  2. Copy float value to memory block
    By JMK in forum C++ Programming
    Replies: 3
    Last Post: 09-18-2010, 02:07 PM
  3. Need loop to copy chunks from byte array
    By jjamjatra in forum C# Programming
    Replies: 2
    Last Post: 03-05-2009, 05:42 AM
  4. Loop a block of code
    By Axiom in forum C Programming
    Replies: 2
    Last Post: 11-23-2008, 11:33 PM
  5. realloc() doesnt copy the block correctly!
    By TalosChen in forum C Programming
    Replies: 7
    Last Post: 05-19-2006, 05:33 AM