sorting txt records

**baxy** · 08-16-2013

Hi,

I have this problem that involve large txt files. So first let me introduce those records:

So i have this large txt file. It is a two column txt file. first column is a integer id (there are 20 mil of those) and the second is a regular text (long one). the whole file is about 10GB in size.

Example:

1 kjfikjsdfgdkfng...
2 jbsdfdbffdghdfhfg....
3 dfgsdefgsdfgbdfg....
4 sdfgdfgdsg.....

Problem: I need to resort each record (row) according to my own key(order)

I was thinking of sqlite engine but since the engine is putting everything in RAM and i don't have that much memory it's not an option. Additional requirement is that this sorting engine has to be a part of my application.

So is there a way to reorder such large records on a regular desktop machine?

baxy

**kpark91** · 08-16-2013

Depending on the size of your RAM, you could use the method "External Sorting" / "Merge Sort"
This is probably the easiest method to implement.

1) Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.
2) Write the sorted data to disk.
3) Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.
4) Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
5) Perform a 9-way merge and store the result in the output buffer. Whenever the output buffer fills, write it to the final sorted file and empty it. Whenever any of the 9 input buffers empties, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available. This is the key step that makes external merge sort work externally -- because the merge algorithm only makes one pass sequentially through each of the chunks, each chunk does not have to be loaded completely; rather, sequential parts of the chunk can be loaded as needed.

from Wikipedia

**rcgldr** · 08-16-2013

A 9-way or k-way merge with k > 2 is probably more than what is needed for your program. A standard 2-way merge sort will be enough to sort your data in a reasonable amount of time.

As the wiki article mentions, reading multiple records per I/O will save a lot of time, but the system buffering and disk drive caching will help here even if reading one line at a time. If this turns out to be too slow, then you'll need to read the file in larger chunks, and handle line boundaries in your own code. To deal with variable length records, but with known maximum size, read in the chunks at buffer + maximum_record_size offset, rounded up to some multiple of 4, 8, 16, or 32 bytes to keep the reads on a friendly buffer boundary. You'll have a partial record at the end of the read chunk, which you can move to below the read pointer boundary, so that it lines up with the rest of the record on the next read.

If max record size is less than 256 bytes, then you can replace the newlines with record lengths to speed up the sort. Again, you'll need to use an offset in the read buffer in order to leave space to prefix the first record with a record length byte.

You may want to use an array of pointers to records and sort those, then sort the data using the pointers, but on my system(s), this isn't faster until record size is about 128 to 256 bytes, depending if pointers are 32 or 64 bits.

You didn't mention where the key data is stored. Is this part of each record in the text file or a separate file?

**Adak** · 08-16-2013

When I needed to code up an external merge sort, I used Quicksort to handle the initial sorting of the data, then a two way merge sort. My data file was much smaller, (but the system was DOS based, so there were many small sorted data files to be merged).

I was quite pleased with the speed of the sorting.

Some questions for you, if you don't mind.

What are your expectations for the run-time, and is it critical?
What run-time have you had sorting this data, before?
Would you be able to post a 100Mb sample of the data and key so we can see just what we're talking about?
(If so, please post the file to a file depot, and just post the link to it, here.) It doesn't need to be real data, just something for testing purposes.
Something like Swoopshare, where no registration is required:
swoopshare - Save files on the web

File max is 500Mb.

**baxy** · 08-17-2013

ok maybe i wasn't clear enough. I am not trying to sort out my records numerically or lexicographically. I have my own order that needs to be implemented and that has no structure. Example let say I have an array of numbers 1,5,2,6,8,15,12,35,6,7,18,9,10 and i need to order them according to the following key :

value position
1 7
5 2
6 3
8 4
15 1
12 5
35 8
16 6
7 11
18 9
9 10
10 12

vhat comes to my mind is hashing and printing or using engines like sqlite, mysql to do the job but since to each value a large text record is tide to putting everything into memory is not an option. I could apply hashing, ordering and merging now that you have mentioned it.

I know this is c forum but you can use this perl oneliner the generate the data :

Code:

 perl -e '@a = (a..z);for(1..50){print "$_\t"; for(1..100){print $a[rand @a]}; print "\n"}'

and you can use any random order of numbers from 1-50 as your key (numbers should not duplicate)

by changing 50 to 1000 you are increasing the number of records and by changing 100 to >100 you are increasing tha size of an individual record

baxy

**Adak** · 08-17-2013

You were clear earlier - you want to sort the data by your own key. That isn't a problem.

As long as the program can know the order of the keys before the actual sorting takes place. I'll post a simple example of this, in a follow up post, a little later. Perl is not a problem.

**rcgldr** · 08-17-2013

Originally Posted by baxy

I have my own order that needs to be implemented and that has no structure. Example let say I have an array of numbers 1,5,2,6,8,15,12,35,6,7,18,9,10 and i need to order them according to the following key ...

Then treat a record and it's associated key as a single element, and sort the elements by the key, using any sort method you want. For a hard drive sort, a merge sort will be fastest, since it reduces random access. Reading and writing multiple elements for each read or write command (this works with merge sort), will further speed up the process.

What you don't want to do is sort keys, and the try to access a large file according to key order, since that will result in one random access per record retrieved, which is very slow compared to the streaming rate of a hard drive.

**Adak** · 08-17-2013

A simple example of sorting using an external key. As rcgldr noted, it can also be done with a key which is attached to each data item.

Because the amount of data is quite large, I would use an external key, in a file.

Code:

#include <stdio.h>

void printIt(int *a,int size);

int main(void) {
   int i, j, temp;
   int a[]={3,8,2,0,4,1,6,9,5,7};
   
   int key[10]={0,2,4,6,8,9,7,5,3,1};  
   //desired order is: {0,9,1,8,2,7,3,6,4,5};

   printf("Original data order is:\n");
   printIt(a,10); //original order
   printf("My desired key order is: 0,9,1,8,2,7,3,6,4,5\n\n");

   //sort via Substitution sort - just for this demo
   for(i=0;i<10-1;i++) {
      for(j=i+1;j<10;j++) {
         if(key[a[i]] > key[a[j]]) {
            temp = a[i];
            a[i] = a[j];
            a[j] = temp;
         }
      }
   }
   printf("Data order after sorting, is: \n");
   printIt(a, 10); //as sorted with my key
   printf("Which matches the desired key order.\n\n");
   return 0;
}

void printIt(int *a, int size) {
   int i;
   
   for(i=0;i<size;i++)
      printf("%2d ", a[i]);
   
   printf("\n\n");
}

**Nominal Animal** · 08-17-2013

If your desktop machine is a 64-bit Linux (or Mac OS X), you can use memory mapping, and let the kernel worry about which parts of the input file to keep in memory. For example:

Code:

#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <stdio.h>

typedef struct {
    const char *ptr;
    size_t      len;
    /* Sort key? */
    long        keyval;
    const char *keyptr;
    size_t      keylen;
} line_t;

typedef struct {
    int         map_desc;   /* File descriptor */
    size_t      map_size;   /* Map size */
    size_t      map_used;   /* File size */
    const char *map_data;   /* Memory-mapped data */
    size_t      lines;      /* Number of lines */
    line_t     *line;       /* Array of lines, includes newlines */
} file_t;


/* Compare two lines.
 * This sorts by keyval, then case insensitively at keyptr.
*/
static int compare_lines(const void *left, const void *right)
{
    const long   val1 = ((const line_t *const)left)->keyval;
    const char  *key1 = ((const line_t *const)left)->keyptr;
    const size_t len1 = ((const line_t *const)left)->keylen;
    const long   val2 = ((const line_t *const)right)->keyval;
    const char  *key2 = ((const line_t *const)right)->keyptr;
    const size_t len2 = ((const line_t *const)right)->keylen;

    if (val1 < val2)
        return -1;
    else
    if (val1 > val2)
        return +1;

    if (len1 < len2)
        return (strncasecmp(key1, key2, len1) <= 0) ? -1 : +1;
    else
    if (len1 > len2)
        return (strncasecmp(key1, key2, len2) < 0) ? -1 : +1;
    else
        return strncasecmp(key1, key2, len1);
}


/* Define a line.
*/
static void define_line(line_t *const line, const char *ptr, const char *const end)
{
    long val = 0L;
    int  negative = 0;

    line->ptr = ptr;
    line->len = (size_t)(end - ptr);

    /* Skip leading whitespace. */
    while (ptr < end && (*ptr == '\t' || *ptr == ' ')) ptr++;

    /* Parse signs. */
    while (ptr < end && (*ptr == '+' || *ptr == '-'))
        if (*(ptr++) == '-')
            negative = !negative;

    /* Parse numeric value. */
    while (ptr < end && *ptr >= '0' && *ptr <= '9')
        val = 10L * val + (long)(*(ptr++) - '0');

    /* Skip whitespace. */
    while (ptr < end && (*ptr == '\t' || *ptr == ' ')) ptr++;

    /* The second token is the key. */
    line->keyval = (negative) ? -val : val;
    line->keyptr = ptr;
    line->keylen = (size_t)(end - ptr);
}

static int get_lines(const char *const data,
                     const size_t      size,
                     line_t    **const arrayptr,
                     size_t     *const countptr)
{
    const char       *cur = data;
    const char       *ptr;
    const char *const end = data + size;

    line_t *line = NULL;
    size_t  lines = 0;
    size_t  lines_max = 0;

    if (!arrayptr || !countptr)
        return errno = EINVAL;

    while (cur < end) {

        /* Find end of this line. */
        ptr = cur;
        while (ptr < end && (*ptr != '\n' && *ptr != '\r'))
            ptr++;
        if (ptr < end) {
            if (*ptr == '\n') {
                ptr++;
                if (ptr < end && *ptr == '\r')
                    ptr++;
            } else
            if (*ptr == '\r') {
                ptr++;
                if (ptr < end && *ptr == '\n')
                    ptr++;
            }
        }

        /* Allocate more lines if necessary. */
        if (lines >= lines_max) {
            line_t *const old = line;

            lines_max = (lines | 65535) + 65537;
            line = realloc(old, lines_max * sizeof *line);
            if (!line) {
                free(old);
                return errno = ENOMEM;
            }
        }

        define_line(&line[lines], cur, ptr);
        lines++;
        cur = ptr;
    }

    /* Optimize line array. */
    if (!lines) {
        free(line);
        line = NULL;
        lines_max = 0;
    } else
    if (lines != lines_max) {
        line_t *temp;
        temp = realloc(line, lines * sizeof *line);
        if (temp) {
            line = temp;
            lines_max = lines;
        }
    }

    *arrayptr = line;
    *countptr = lines;
    return 0;
}
              

int open_input(file_t *const file, const char *const name)
{
    struct stat info;
    const char *data;
    size_t length;
    long pagesize;
    int fd, result;

    if (!file || !name || !*name)
        return errno = EINVAL;

    file->map_data = MAP_FAILED;
    file->map_used = 0;
    file->map_size = 0;

    file->lines = 0;
    file->line = NULL;

    pagesize = sysconf(_SC_PAGE_SIZE);
    if (pagesize < 1L)
        return errno = ENOTSUP;

    do {
        fd = open(name, O_RDONLY | O_NOCTTY);
    } while (fd == -1 && errno == EINTR);
    if (fd == -1)
        return errno;

    if (fstat(fd, &info)) {
        do {
            result = close(fd);
        } while (result == -1 && errno == EINTR);
        return errno = EPERM;
    }

    /* Length must be a positive multiple of page size. */
    if (!info.st_size || (long)info.st_size % pagesize)
        length = (size_t)info.st_size + (size_t)pagesize - (size_t)((long)info.st_size % pagesize);
    else
        length = (size_t)info.st_size;

    data = mmap(NULL, length, PROT_READ, MAP_SHARED | MAP_NORESERVE, fd, 0);
    if (data == MAP_FAILED) {
        const int cause = errno;
        do {
            result = close(fd);
        } while (result == -1 && errno == EINTR);
        return errno = cause;
    }

    /* We will be accessing the file linearly. */
    posix_fadvise(fd, 0, info.st_size, POSIX_FADV_SEQUENTIAL);
    posix_madvise((void *)data, length, POSIX_MADV_SEQUENTIAL);

    if (get_lines(data, (size_t)info.st_size, &file->line, &file->lines)) {
        munmap((void *)data, length);
        do {
            result = close(fd);
        } while (result == -1 && errno == EINTR);
        return errno = ENOMEM;
    }

    /* From this point forwards, the file and map access will be random. */
    posix_fadvise(fd, 0, info.st_size, POSIX_FADV_RANDOM);
    posix_madvise((void *)data, length, POSIX_MADV_RANDOM);

    file->map_desc = fd;
    file->map_data = data;
    file->map_size = length;
    file->map_used = (size_t)info.st_size;

    return 0;
}            

void close_input(file_t *const file)
{
    if (file) {
        int result;

        if (file->map_data != MAP_FAILED && file->map_size)
            munmap((void *)file->map_data, file->map_size);

        if (file->map_desc != -1)
            do {
                result = close(file->map_desc);
            } while (result == -1 && errno == EINTR);

        free(file->line);

        file->map_desc = -1;
        file->map_data = MAP_FAILED;
        file->map_size = 0;
        file->map_used = 0;
        file->lines = 0;
        file->line = NULL;
    }
}

/* create: -1: Never, file must exist
 *          0: Create if necessary
 *         +1: Always create
*/
int open_output(const char *const filename, const int create)
{
    int fd;

    if (!filename || !*filename) {
        errno = EINVAL;
        return -1;
    }

    if (create < 0)
        do {
            fd = open(filename, O_RDWR | O_NOCTTY);
        } while (fd == -1 && errno == EINTR);
    else
    if (create > 0)
        do {
            fd = open(filename, O_RDWR | O_CREAT | O_EXCL, 0666);
        } while (fd == -1 && errno == EINTR);
    else
        do {
            fd = open(filename, O_RDWR | O_CREAT | O_NOCTTY, 0666);
        } while (fd == -1 && errno == EINTR);

    return fd;
}

int close_output(const int fd)
{
    int result;

    if (fd == -1)
        return 0;

    do {
        result = close(fd);
    } while (result == -1 && errno == EINTR);
    if (result == -1)
        return errno;

    return 0;
}

int write_all(const file_t *const file, const int to_fd)
{
    if (!file || to_fd == -1)
        return errno = EINVAL;

    /* Reset file position. Ignore errors (in case to_fd is not a file). */
    lseek(to_fd, 0, SEEK_SET);

    /* Resize file to final size. Ignore errors. */
    if (ftruncate(to_fd, (off_t)file->map_used)) {
        /* Ignore errors */
    }

    /* Preallocate file. Ignore errors. */
    posix_fallocate(to_fd, (off_t)0, (off_t)file->map_used);

    /* Write the lines. */
    {
        const size_t        lines = file->lines;
        const line_t *const line = file->line;
        size_t              i;

        for (i = 0; i < lines; i++) {
            const char       *p = line[i].ptr;
            const char *const q = line[i].ptr + line[i].len;
            ssize_t                    n;

            while (p < q) {
                n = write(to_fd, p, (size_t)(q - p));
                if (n > (ssize_t)0)
                    p += n;
                else
                if (n != (ssize_t)-1)
                    return errno = EIO;
                else
                if (errno != EINTR)
                    return errno;
            }

            if (p != q)
                return errno = EIO; /* Should never happen. */
        }
    }

    /* Success. */
    return 0;
}

int main(int argc, char *argv[])
{
    file_t input;
    const char *outname;
    int output, result;

    if (argc < 2 || argc > 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s INPUT-FILE [ OUTPUT-FILE ]\n", argv[0]);
        fprintf(stderr, "\n");
        return 1;
    }

    if (argc >= 3) {
        outname = argv[2];
        output = open_output(outname, +1); /* Always create */
        if (output == -1) {
            fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
            return 1;
        }
    } else {
        output = STDOUT_FILENO;
        outname = NULL;
    }

    if (open_input(&input, argv[1])) {
        fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
        if (outname) {
            unlink(outname);
            do {
                result = close(output);
            } while (result == -1 && errno == EINTR);
        }
        return 1;
    }

    fprintf(stderr, "Read %lu lines.\nSorting .. ", (unsigned long)input.lines);
    fflush(stderr);

    qsort(input.line, input.lines, sizeof input.line[0], compare_lines);

    fprintf(stderr, "Done.\nSaving .. ");
    fflush(stderr);

    if (write_all(&input, output)) {
        if (outname) {
            fprintf(stderr, "%s: %s.\n", outname, strerror(errno));
            unlink(outname);
            do {
                result = close(output);
            } while (result == -1 && errno == EINTR);
        } else
            fprintf(stderr, "Error writing to standard output: %s.\n", strerror(errno));
        return 1;
    }

    if (close_output(output)) {
        if (outname) {
            fprintf(stderr, "%s: Write error: %s.\n", outname, strerror(errno));
            unlink(outname);
        } else
            fprintf(stderr, "Error writing to standard output: %s.\n", strerror(errno));
        return 1;
    }

    fprintf(stderr, "Done.\n");
    close_input(&input);
    return 0;
}

The above code defines the numeric value of the initial field in keyval, and the start and length of the rest of the fields on that line in keyptr and keylen; see the define_line() function.

When all lines have been scanned (read once), main() has input.lines lines in input.line array. They are read-only, but you can access them as if they were in memory. (In Linux, the MAP_NORESERVE flag means swap is not used, so it's important to make sure the mapping is read-only.)

The above example main() uses compare_lines() and qsort() to sort the input.line array.

For string comparisons, where the contents of each line must be examined, using some form of a (balanced) binary tree instead of a linear array tends to be more efficient. That also interleaves the reads (from slow storage) with CPU work (examining the line contents), which means the operation completes in less time (wall-clock wise; it does tend to use more CPU). It is what I personally would do; I just wanted to keep this example straightforward.

**Cat** · 08-17-2013

Some other questions:

1. How often do you need to sort this data (e.g. is this something you need to do just once, or are you regularly sorting data)?

2. How quickly do you need the data to be sorted?

3. Presumably, you have a need to access this data after sorting it - in that case, a gigantic text file may be a poor long-term storage option for other reasons.

**rcgldr** · 08-17-2013

The ratio of random access time versus data transfer time to a hard drive needs to be taken into account. Assume a streaming rate of 50 mega-bytes / second, and a text line of combined record + key size of 250 bytes, data transfer time is 20,000 lines per second, while random access is going to be less than 200 access per second. That's a ratio of 100 to 1. If it took 25 passes of reading and writing an entire file, that would be about twice as fast as making one read only random access pass (say via a sorted key) of that same file.

This problem would be better solved by doing a conventional external merge sort of the data, using large I/O, like transferring 10MB to 100MB per read or write call.

**Nominal Animal** · 08-18-2013

Originally Posted by rcgldr

The ratio of random access time versus data transfer time to a hard drive needs to be taken into account.

If you are responding to my suggestion of using a memory-map, you're completely forgetting how the kernel manages the page cache.

If the input lines are immediately placed into some form of a (balanced) binary tree, the nodes near the root of the tree will stay hot (in cache). Furthermore, you do the comparison work concurrently with the disk I/O.

Originally Posted by rcgldr

This problem would be better solved by doing a conventional external merge sort of the data, using large I/O, like transferring 10MB to 100MB per read or write call.

Compare the two methods (using say a red-black tree), and you'll find you're wrong in this particular case. Assuming you compare them using wall clock time.

Using a conventional external merge sort is useful when having very limited RAM. A 10GB dataset is not that large compared to a typical 64-bit workstation; something like maybe three times the available RAM. That is still within the capability of the kernel to manage efficiently, in my experience.

The main difference is being able to sort the data concurrently to I/O. A conventional external merge sort does not do that.

For limited-size data sets, especially those that fit into available RAM, or are just a few times larger than the available RAM, the memory-mapping technique is simple, easy to implement, and tends to beat any sort that does I/O first, then sort, then I/O, and so on.

In particular, the memory-mapping technique is simple enough for new C programmers to understand and work with: it just maps the data into memory. The only things you need to remember is that you cannot modify the data, and you cannot assume there is an extra NUL ('\0', end-of-string mark) at the end of the data. The kernel handles the rest.

A multithreaded external merge sort does perform better, though. (DO NOT try to use multiple threads to sort a single block, as that just causes cacheline ping-pong and unreadable code. I mean multiple threads so that there is always at least one thread doing I/O.) These have perfectly sequential access patterns, and the kernel can be advised to provide best throughput for that wrt. memory usage. The main reason for using multiple threads is to keep I/O ongoing at all times, possibly in more than one stream.

The number of threads used is not related to the number of CPU cores on the machine, but to the optimum number of concurrent I/O streams on the specific machine. Even on a desktop workstation with a single non-RAID disk, that is not necessarily "1". (It might be in Windows, but I don't use Windows; on Linux and FreeBSD, it is usually 2, or more if using RAID; I warmly recommend software-RAID 0/1 on a workstation.) On my workstation with a two-disk RAID, four tends to yield best results, even with other typical-desktop-I/O use at the same time.

I could show a POSIX implementation of such here, if there is real interest?
Compared to the memory-mapping example, it is obviously much more complicated, and therefore further out of reach for new programmers.

**rcgldr** · 08-18-2013

Originally Posted by Nominal Animal

I could show a POSIX implementation of such here, if there is real interest?

Need to wait for feed back from the original poster. It's not clear if the keys can be paired up with the data in the text file to form a combined record and then sort those records.

Since it is a 10GB file, if he's running a 64 bit operating system, he could consider upgrading to 16GB or 32GB of ram. Another option would be to use an SSD drive, in which case there's almost no random access penalty, allowing for a bigger choice in algorithms. In this thread, there's an algorithm that uses a list of sorted indexes to sort the data the indexes point to (it's near the end of the thread). I don't know how efficient it is, but it's minimal code (a for loop with 4 lines of code).

Is there a name for this sort algorithm?

**baxy** · 08-18-2013

1. How often do you need to sort this data (e.g. is this something you need to do just once, or are you regularly sorting data)?

2. How quickly do you need the data to be sorted?

3. Presumably, you have a need to access this data after sorting it - in that case, a gigantic text file may be a poor long-term storage option for other reasons.

1. I need to sort this data once, before processing it further.

2. I believe that speed at this point is not a real issue, however faster the better

3. Yes i know, but this is a standard data format designed like that for portability and i do not want to invent another "better" standard.

thnx

**baxy** · 08-18-2013

Need to wait for feed back from the original poster. It's not clear if the keys can be paired up with the data in the text file to form a combined record and then sort those records.

well keys can be paired but need to be removed afterwords and i don't think this is a problem to do. unfortunately i have only 8GB of memory and 0$ at my disposal for an upgrade. But i do have SSD on my linux machine

PS
thank you all for your input. This post was a really nice reading material and extremely helpful

thank you !!!

baxy

Thread: sorting txt records

Thread Tools

Search Thread

Display

sorting txt records

Similar Threads

Sorting student records by letter grade

Sorting Records

Files&Records(Entering records)

Records. look

sorting records on a file by more than one field