Thread: Sorting an array of strings

  1. #16
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    Quote Originally Posted by std10093 View Post
    Jumby I suggest you again, to take a look at my first post's link, in order to use another algorithm! Or are you "obliged" to use this one?

    oogabooda, I am very curious to see how.
    I don't know what you're talking about, and neither do you.
    And your first post, suggesting a "more efficient" sort, is completely wrong.
    You either did not read, or did not understand the OP's post.
    He said he's only adding one string at a time to the array, and wants to insert it into place.
    Please try to remember that you are a beginner.
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  2. #17
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    Quote Originally Posted by Adak View Post
    It's faster to use strcpy() for short strings, than it is to swap pointers. The "break even" point varies from system to system, but has been reported as about 100 to 200 chars in length.
    But how can that be, Adak? Are you saying that its faster to copy 100 bytes than 4 or 8? That seems implausible. Can you provide a link to your source?
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  3. #18
    Registered User camel-man's Avatar
    Join Date
    Jan 2011
    Location
    Under the moon
    Posts
    693
    @oogabooga, so essentially you are saying that he can rearrange the pointers to sort instead of changing the contents of what the pointers point to. I have not done a program in C in a little while now and I am trying to refresh my memory a bit by engaging in these discussions on the forum

  4. #19
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by oogabooga View Post
    But how can that be, Adak? Are you saying that its faster to copy 100 bytes than 4 or 8? That seems implausible. Can you provide a link to your source?
    The issue is that the compare process puts much of the strings into the cache in addition to having to cache the pointers. Once strings are in the read cache, and the write back cache hides some of the overhead. For my tests with 32 bit pointers, the break even point for bottom up merge sorting data versus sorting pointers then doing one pass using the pointers to place the data in sorted order was somewhere between 128 and 256 bytes.

  5. #20
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    Quote Originally Posted by camel-man View Post
    @oogabooga, so essentially you are saying that he can rearrange the pointers to sort instead of changing the contents of what the pointers point to. I have not done a program in C in a little while now and I am trying to refresh my memory a bit by engaging in these discussions on the forum
    Yes. As long as he's actually dealing with pointers, then swapping the pointers is the way to go. So he'll have to malloc the string storage into array elements of type char*.
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  6. #21
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    Quote Originally Posted by rcgldr View Post
    The issue is that the compare process puts much of the strings into the cache in addition to having to cache the pointers. Once strings are in the read cache, and the write back cache hides some of the overhead. For my tests with 32 bit pointers, the break even point for bottom up merge sorting data versus sorting pointers then doing one pass using the pointers to place the data in sorted order was somewhere between 128 and 256 bytes.
    Well that sheds some light on it. You're saying that since in either case the strings must be compared, the string data is cached anyway. But I still can't see why swapping pointers would be slower.
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  7. #22
    Ultraviolence Connoisseur
    Join Date
    Mar 2004
    Posts
    555
    I agree, makes no sense as you still have to write out the entire string to the new memory address, caching only helps with the read, even if it's precached in memory it still has to write out to the new memory address. So that would write out 128 bytes instead of a pointer which would only be a few.

  8. #23
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    Quote Originally Posted by oogabooga View Post
    I don't know what you're talking about, and neither do you.
    And your first post, suggesting a "more efficient" sort, is completely wrong.
    You either did not read, or did not understand the OP's post.
    He said he's only adding one string at a time to the array, and wants to insert it into place.
    Please try to remember that you are a beginner.
    I did not pretend to be the master or a guru or anything like that.. Maybe writing more precise titles in the threads will help in the future.
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  9. #24
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    Quote Originally Posted by std10093 View Post
    I did not pretend to be the master or a guru or anything like that.. Maybe writing more precise titles in the threads will help in the future.
    That's true. I didn't even notice the title.
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  10. #25
    Ultraviolence Connoisseur
    Join Date
    Mar 2004
    Posts
    555
    Here is an example that does a naive test for PTR swapping vs strcpy swapping:
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <time.h>
    
    #define DICTFILE "/usr/share/dict/words"
    #define MAXRUNS 1000
    
    struct words {
        char ** word;
        size_t word_count;
        size_t block_size;;
    };
    
    int addWord(struct words * wl, const char * word)
    {
        int r = 0;
    
        /* Realloc wl->word */
        if (wl->word_count >= wl->block_size-1) {
            char ** tmp;
            wl->block_size <<= 1; /* increase by x2 */
            tmp = realloc(wl->word, wl->block_size * sizeof *wl->word);
            if (!tmp)  {
                wl->block_size >>= 1; /* we didn't succeed, reset to what it was */
                return r;
            }
            wl->word = tmp; /* success */
        }
    
        wl->word[wl->word_count] = malloc(strlen(word)+1);
        if (wl->word[wl->word_count]) {
            strcpy(wl->word[wl->word_count], word);
            ++wl->word_count;
            r = 1;
        }
    
        return r;
    }
    
    #define WL_PTR_SWAP(a,b) do {\
        char * tmp___ = (a); \
        (a) = (b);           \
        (b) = tmp___;        \
    }while(0)
    
    #define WL_STRCPY_SWAP(a,b) do {\
        char tmp___[BUFSIZ+1];      \
        strcpy(tmp___,a);           \
        strcpy(a, b);               \
        strcpy(b,tmp___);           \
    }while(0)
    
    int main(void)
    {
        struct words wl = {
            .word_count = 0,
            .block_size = 256
        };
        char rb[BUFSIZ+1];
        FILE * fp;
        size_t i;
        int run;
        clock_t start, t1[2], t2[2];
    
        start = clock(); /* get an initial offset to calculate from, see manpage */
    
        wl.word = calloc(wl.block_size, sizeof *wl.word);
        if (!wl.word) {
            perror("calloc");
            return 1;
        }
    
        /* Load in some words from DICTFILE */
        fp = fopen(DICTFILE, "r");
        if (!fp) 
            return 1;
        while (fgets(rb, sizeof rb, fp)) {
            char * p;
            if (p = strchr(rb, '\n')) *p = 0;
            addWord(&wl, rb);
        }
        fclose(fp);
    
        /* ------ ROTATE STRING TEST ------- */
        /* Test PTR method */
        t1[0] = clock();
        for (run = 0; run < MAXRUNS; ++run) {
            for (i = 0; i < wl.word_count; ++i)  {
                WL_PTR_SWAP(wl.word[i],wl.word[0]);
            }
        }
        t1[1] = clock();
    
        /* Test STRCPY method */
        t2[0] = clock();
        for (run = 0; run < MAXRUNS; ++run) {
            for (i = 0; i < wl.word_count; ++i)  {
                WL_STRCPY_SWAP(wl.word[i],wl.word[wl.word_count-1]);
            }
        }
        t2[1] = clock();
        
        /* Normalize times, see manpage for clock() */
        t1[0] -= start;
        t1[1] -= start;
        t2[0] -= start;
        t2[1] -= start;
    
        printf("List size: %lu\n", wl.word_count);
        printf("Runs: %d PTR test: %.4lfs\nRuns: %d STRCPY test: %.4lfs\n",
                MAXRUNS, (double)(t1[1]-t1[0]) / CLOCKS_PER_SEC, 
                MAXRUNS, (double)(t2[1]-t2[0]) / CLOCKS_PER_SEC);
    
        return 0;
    }
    I used /usr/share/dict/words on my system, you would have to change it to some list of words (one string on each line) for your system..

    Note I don't use strcmp() although I fail to see how adding an operation would "speed" things up somehow..?
    Sample output:
    Code:
    $ ./a.out 
    List size: 234937
    Runs: 1000 PTR test: 3.3900s
    Runs: 1000 STRCPY test: 20.3400s
    (I ran this several times and the results were consistent each time)

  11. #26
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by oogabooga View Post
    But how can that be, Adak? Are you saying that its faster to copy 100 bytes than 4 or 8? That seems implausible. Can you provide a link to your source?
    In a word, yes. Locality of the word within cache, is much better than for pointers.

    I couldn't find the exact source of that assertion, but this affirms it as well. It's from a pdf file titled:
    Cache-Efficient String Sorting Using Copying

    by:

    Ranjan Sinha*
    School of Computer Science and Information Technology
    RMIT University, GPO Box 2476V, Melbourne 3001, Australia
    [email protected]

    David Ring
    Palo Alto, [email protected]
    Justin Zobel
    School of Computer Science and Information Technology
    RMIT University, GPO Box 2476V, Melbourne 3001, Australia
    [email protected]


    Several algorithms for fast sorting of strings in internal memory have been described in recent years,
    including improvements to the standard quicksort (Bentley and McIlroy, 1993), multikey quicksort (Bentley
    and Sedgewick, 1997), and variants of radix sort (Andersson and Nilsson, 1998). The fastest such
    algorithm is our burstsort (Sinha and Zobel, 2004a), in which strings are processed depth-first rather than
    breadth-first and a complete trie of common prefixes is used to distribute strings amongst buckets. In all of
    these methods, the strings themselves are left in place, and only pointers are moved into sort order. This
    pointerized approach to string sorting has become a standard solution to the length and length variability
    of strings, as expressed by Andersson and Nilsson (1998):

    "to implement a string sorting algorithm efficiently we should not move the strings themselves
    but only pointers to them. In this way each string movement is guaranteed to take constant
    time."

    In this argument, moving the variable-length strings in the course of sorting is considered inefficient. Sorting
    uses an array of pointers to the strings, and, at the completion of sorting, only the pointers need to be
    in sort order.

    However, evolutionary developments in computer architecture mean that the efficiency of sorting strings
    via pointers needs to be reexamined. Use of pointers is efficient if they are smaller than strings and if the
    costs of accessing and copying strings and pointers are uniform, but the speed of memory access has fallen
    behind processor speeds and most computers now have significant memory latency. Each string access
    can potentially result in a cache miss; that is, if a collection is too large to fit in cache memory, the cost
    of out-of-cache references may dominate other factors, making it worthwhile to move even long strings in
    order to increase their spatial locality. It is the cost of accessing strings via pointers that gives burstsort
    (which uses no fewer instructions than other methods) a speed advantage.
    This is badly worded, but what he means is that, the best burst sort version now, does NOT use pointers, but copies
    the strings themselves, while nearly all the other algorithms, still use pointers instead.

  12. #27
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    That last bit is definitely badly worded, but I'll take your word for it that it means what you say.


    So that's another piece of the puzzle. Actually moving the string data will increase the locality of further comparisons, whereas using pointers means you could be comparing strings from all over the place.


    And nonpuz's example would not be pertinent since it's just testing the swapping and is not actually a sort.
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  13. #28
    Ultraviolence Connoisseur
    Join Date
    Mar 2004
    Posts
    555
    I think the point is missed that while yes there is a increase in cache locality of the strings by copying them there is also still the clear penalty of the strcpy() itself, which my code demonstrates.

    My point is, will the benefit of the increased locality of comparisons outweight the penalty of writing out an entire string?

    Not to mention all this is still hinging on the breaking point of strings being less than 256 bytes, what if you have 5000 strings that are 255 bytes and 1000 that are 2000 bytes long?

    And also, not a very good sorting algorithm that enforces restrictions on the length of inputted data to maintain efficiency.

  14. #29
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by nonpuz View Post
    I think the point is missed that while yes there is a increase in cache locality of the strings by copying them there is also still the clear penalty of the strcpy() itself, which my code demonstrates.

    My point is, will the benefit of the increased locality of comparisons outweight the penalty of writing out an entire string?

    Not to mention all this is still hinging on the breaking point of strings being less than 256 bytes, what if you have 5000 strings that are 255 bytes and 1000 that are 2000 bytes long?

    And also, not a very good sorting algorithm that enforces restrictions on the length of inputted data to maintain efficiency.
    I'm not sure just what your program demonstrates.

    I'm sorting 500,000 random English words using strcpy(), in five bathches of 100,000 words each.

    Total time is 1.06 seconds, which includes subsequent merging of all five 100,000 word files, into one sorted file.

    English words (non scientific descriptions or made up "words" excluded), have a maximum length of 28 letters, by my definition of a word. (antidisestablishmentarianism)

    Very long strings definitely benefit by being sorted by only moving a pointer.

  15. #30
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    Quote Originally Posted by Adak View Post
    I'm not sure just what your program demonstrates.

    I'm sorting 500,000 random English words using strcpy(), in five bathches of 100,000 words each.

    Total time is 1.06 seconds, which includes subsequent merging of all five 100,000 word files, into one sorted file.

    English words (non scientific descriptions or made up "words" excluded), have a maximum length of 28 letters, by my definition of a word. (antidisestablishmentarianism)

    Very long strings definitely benefit by being sorted by only moving a pointer.
    But have you also coded it with just pointer swaps?
    And it doesn't seem like a proper test if you're including reading and writing to disk!
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Sorting strings within an array
    By tictac232434 in forum C Programming
    Replies: 7
    Last Post: 11-03-2012, 02:07 PM
  2. Help sorting my array of strings
    By Infamous in forum C Programming
    Replies: 19
    Last Post: 03-10-2011, 11:46 PM
  3. sorting an array of strings
    By porstart in forum C Programming
    Replies: 3
    Last Post: 02-22-2011, 10:43 PM
  4. Sorting an array of strings
    By porstart in forum C Programming
    Replies: 2
    Last Post: 11-21-2010, 09:46 PM
  5. Sorting strings in an array
    By JLan in forum C Programming
    Replies: 4
    Last Post: 11-29-2003, 05:10 PM

Tags for this Thread