Sorting an array of strings

**oogabooga** · 10-12-2013

Originally Posted by std10093

Jumby I suggest you again, to take a look at my first post's link, in order to use another algorithm! Or are you "obliged" to use this one?

oogabooda, I am very curious to see how.

I don't know what you're talking about, and neither do you.
And your first post, suggesting a "more efficient" sort, is completely wrong.
You either did not read, or did not understand the OP's post.
He said he's only adding one string at a time to the array, and wants to insert it into place.
Please try to remember that you are a beginner.

**oogabooga** · 10-12-2013

Originally Posted by Adak

It's faster to use strcpy() for short strings, than it is to swap pointers. The "break even" point varies from system to system, but has been reported as about 100 to 200 chars in length.

But how can that be, Adak? Are you saying that its faster to copy 100 bytes than 4 or 8? That seems implausible. Can you provide a link to your source?

**camel-man** · 10-12-2013

@oogabooga, so essentially you are saying that he can rearrange the pointers to sort instead of changing the contents of what the pointers point to. I have not done a program in C in a little while now and I am trying to refresh my memory a bit by engaging in these discussions on the forum

**rcgldr** · 10-12-2013

Originally Posted by oogabooga

But how can that be, Adak? Are you saying that its faster to copy 100 bytes than 4 or 8? That seems implausible. Can you provide a link to your source?

The issue is that the compare process puts much of the strings into the cache in addition to having to cache the pointers. Once strings are in the read cache, and the write back cache hides some of the overhead. For my tests with 32 bit pointers, the break even point for bottom up merge sorting data versus sorting pointers then doing one pass using the pointers to place the data in sorted order was somewhere between 128 and 256 bytes.

**oogabooga** · 10-12-2013

Originally Posted by camel-man

@oogabooga, so essentially you are saying that he can rearrange the pointers to sort instead of changing the contents of what the pointers point to. I have not done a program in C in a little while now and I am trying to refresh my memory a bit by engaging in these discussions on the forum

Yes. As long as he's actually dealing with pointers, then swapping the pointers is the way to go. So he'll have to malloc the string storage into array elements of type char*.

**oogabooga** · 10-12-2013

Originally Posted by rcgldr

The issue is that the compare process puts much of the strings into the cache in addition to having to cache the pointers. Once strings are in the read cache, and the write back cache hides some of the overhead. For my tests with 32 bit pointers, the break even point for bottom up merge sorting data versus sorting pointers then doing one pass using the pointers to place the data in sorted order was somewhere between 128 and 256 bytes.

Well that sheds some light on it. You're saying that since in either case the strings must be compared, the string data is cached anyway. But I still can't see why swapping pointers would be slower.

**nonpuz** · 10-12-2013

I agree, makes no sense as you still have to write out the entire string to the new memory address, caching only helps with the read, even if it's precached in memory it still has to write out to the new memory address. So that would write out 128 bytes instead of a pointer which would only be a few.

**std10093** · 10-12-2013

Originally Posted by oogabooga

I don't know what you're talking about, and neither do you.
And your first post, suggesting a "more efficient" sort, is completely wrong.
You either did not read, or did not understand the OP's post.
He said he's only adding one string at a time to the array, and wants to insert it into place.
Please try to remember that you are a beginner.

I did not pretend to be the master or a guru or anything like that.. Maybe writing more precise titles in the threads will help in the future.

**oogabooga** · 10-12-2013

Originally Posted by std10093

I did not pretend to be the master or a guru or anything like that.. Maybe writing more precise titles in the threads will help in the future.

That's true. I didn't even notice the title.

**nonpuz** · 10-12-2013

Here is an example that does a naive test for PTR swapping vs strcpy swapping:

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define DICTFILE "/usr/share/dict/words"
#define MAXRUNS 1000

struct words {
    char ** word;
    size_t word_count;
    size_t block_size;;
};

int addWord(struct words * wl, const char * word)
{
    int r = 0;

    /* Realloc wl->word */
    if (wl->word_count >= wl->block_size-1) {
        char ** tmp;
        wl->block_size <<= 1; /* increase by x2 */
        tmp = realloc(wl->word, wl->block_size * sizeof *wl->word);
        if (!tmp)  {
            wl->block_size >>= 1; /* we didn't succeed, reset to what it was */
            return r;
        }
        wl->word = tmp; /* success */
    }

    wl->word[wl->word_count] = malloc(strlen(word)+1);
    if (wl->word[wl->word_count]) {
        strcpy(wl->word[wl->word_count], word);
        ++wl->word_count;
        r = 1;
    }

    return r;
}

#define WL_PTR_SWAP(a,b) do {\
    char * tmp___ = (a); \
    (a) = (b);           \
    (b) = tmp___;        \
}while(0)

#define WL_STRCPY_SWAP(a,b) do {\
    char tmp___[BUFSIZ+1];      \
    strcpy(tmp___,a);           \
    strcpy(a, b);               \
    strcpy(b,tmp___);           \
}while(0)

int main(void)
{
    struct words wl = {
        .word_count = 0,
        .block_size = 256
    };
    char rb[BUFSIZ+1];
    FILE * fp;
    size_t i;
    int run;
    clock_t start, t1[2], t2[2];

    start = clock(); /* get an initial offset to calculate from, see manpage */

    wl.word = calloc(wl.block_size, sizeof *wl.word);
    if (!wl.word) {
        perror("calloc");
        return 1;
    }

    /* Load in some words from DICTFILE */
    fp = fopen(DICTFILE, "r");
    if (!fp) 
        return 1;
    while (fgets(rb, sizeof rb, fp)) {
        char * p;
        if (p = strchr(rb, '\n')) *p = 0;
        addWord(&wl, rb);
    }
    fclose(fp);

    /* ------ ROTATE STRING TEST ------- */
    /* Test PTR method */
    t1[0] = clock();
    for (run = 0; run < MAXRUNS; ++run) {
        for (i = 0; i < wl.word_count; ++i)  {
            WL_PTR_SWAP(wl.word[i],wl.word[0]);
        }
    }
    t1[1] = clock();

    /* Test STRCPY method */
    t2[0] = clock();
    for (run = 0; run < MAXRUNS; ++run) {
        for (i = 0; i < wl.word_count; ++i)  {
            WL_STRCPY_SWAP(wl.word[i],wl.word[wl.word_count-1]);
        }
    }
    t2[1] = clock();
    
    /* Normalize times, see manpage for clock() */
    t1[0] -= start;
    t1[1] -= start;
    t2[0] -= start;
    t2[1] -= start;

    printf("List size: %lu\n", wl.word_count);
    printf("Runs: %d PTR test: %.4lfs\nRuns: %d STRCPY test: %.4lfs\n",
            MAXRUNS, (double)(t1[1]-t1[0]) / CLOCKS_PER_SEC, 
            MAXRUNS, (double)(t2[1]-t2[0]) / CLOCKS_PER_SEC);

    return 0;
}

I used /usr/share/dict/words on my system, you would have to change it to some list of words (one string on each line) for your system..

Note I don't use strcmp() although I fail to see how adding an operation would "speed" things up somehow..?
Sample output:

Code:

$ ./a.out 
List size: 234937
Runs: 1000 PTR test: 3.3900s
Runs: 1000 STRCPY test: 20.3400s

(I ran this several times and the results were consistent each time)

**Adak** · 10-12-2013

Originally Posted by oogabooga

But how can that be, Adak? Are you saying that its faster to copy 100 bytes than 4 or 8? That seems implausible. Can you provide a link to your source?

In a word, yes. Locality of the word within cache, is much better than for pointers.

I couldn't find the exact source of that assertion, but this affirms it as well. It's from a pdf file titled:
Cache-Efficient String Sorting Using Copying

by:

Ranjan Sinha*
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V, Melbourne 3001, Australia
[email protected]

David Ring
Palo Alto, [email protected]
Justin Zobel
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V, Melbourne 3001, Australia
[email protected]

Several algorithms for fast sorting of strings in internal memory have been described in recent years,
including improvements to the standard quicksort (Bentley and McIlroy, 1993), multikey quicksort (Bentley
and Sedgewick, 1997), and variants of radix sort (Andersson and Nilsson, 1998). The fastest such
algorithm is our burstsort (Sinha and Zobel, 2004a), in which strings are processed depth-first rather than
breadth-first and a complete trie of common prefixes is used to distribute strings amongst buckets. In all of
these methods, the strings themselves are left in place, and only pointers are moved into sort order. This
pointerized approach to string sorting has become a standard solution to the length and length variability
of strings, as expressed by Andersson and Nilsson (1998):

"to implement a string sorting algorithm efficiently we should not move the strings themselves
but only pointers to them. In this way each string movement is guaranteed to take constant
time."

In this argument, moving the variable-length strings in the course of sorting is considered inefficient. Sorting
uses an array of pointers to the strings, and, at the completion of sorting, only the pointers need to be
in sort order.

However, evolutionary developments in computer architecture mean that the efficiency of sorting strings
via pointers needs to be reexamined. Use of pointers is efficient if they are smaller than strings and if the
costs of accessing and copying strings and pointers are uniform, but the speed of memory access has fallen
behind processor speeds and most computers now have significant memory latency. Each string access
can potentially result in a cache miss; that is, if a collection is too large to fit in cache memory, the cost
of out-of-cache references may dominate other factors, making it worthwhile to move even long strings in
order to increase their spatial locality. It is the cost of accessing strings via pointers that gives burstsort
(which uses no fewer instructions than other methods) a speed advantage.

This is badly worded, but what he means is that, the best burst sort version now, does NOT use pointers, but copies
the strings themselves, while nearly all the other algorithms, still use pointers instead.

**oogabooga** · 10-12-2013

That last bit is definitely badly worded, but I'll take your word for it that it means what you say.

So that's another piece of the puzzle. Actually moving the string data will increase the locality of further comparisons, whereas using pointers means you could be comparing strings from all over the place.

And nonpuz's example would not be pertinent since it's just testing the swapping and is not actually a sort.

**nonpuz** · 10-12-2013

I think the point is missed that while yes there is a increase in cache locality of the strings by copying them there is also still the clear penalty of the strcpy() itself, which my code demonstrates.

My point is, will the benefit of the increased locality of comparisons outweight the penalty of writing out an entire string?

Not to mention all this is still hinging on the breaking point of strings being less than 256 bytes, what if you have 5000 strings that are 255 bytes and 1000 that are 2000 bytes long?

And also, not a very good sorting algorithm that enforces restrictions on the length of inputted data to maintain efficiency.

**Adak** · 10-12-2013

Originally Posted by nonpuz

I think the point is missed that while yes there is a increase in cache locality of the strings by copying them there is also still the clear penalty of the strcpy() itself, which my code demonstrates.

My point is, will the benefit of the increased locality of comparisons outweight the penalty of writing out an entire string?

Not to mention all this is still hinging on the breaking point of strings being less than 256 bytes, what if you have 5000 strings that are 255 bytes and 1000 that are 2000 bytes long?

And also, not a very good sorting algorithm that enforces restrictions on the length of inputted data to maintain efficiency.

I'm not sure just what your program demonstrates.

I'm sorting 500,000 random English words using strcpy(), in five bathches of 100,000 words each.

Total time is 1.06 seconds, which includes subsequent merging of all five 100,000 word files, into one sorted file.

English words (non scientific descriptions or made up "words" excluded), have a maximum length of 28 letters, by my definition of a word. (antidisestablishmentarianism)

Very long strings definitely benefit by being sorted by only moving a pointer.

**oogabooga** · 10-12-2013

Originally Posted by Adak

I'm not sure just what your program demonstrates.

I'm sorting 500,000 random English words using strcpy(), in five bathches of 100,000 words each.

Total time is 1.06 seconds, which includes subsequent merging of all five 100,000 word files, into one sorted file.

English words (non scientific descriptions or made up "words" excluded), have a maximum length of 28 letters, by my definition of a word. (antidisestablishmentarianism)

Very long strings definitely benefit by being sorted by only moving a pointer.

But have you also coded it with just pointer swaps?
And it doesn't seem like a proper test if you're including reading and writing to disk!