Thread: sort file

  1. #1
    Registered User
    Join Date
    May 2011
    Posts
    116

    sort file

    hello everyone!
    I have a file which has records with 2 fields--one int and one float---
    what is the best way to sort it?
    The sorting has to be based on the int field(the first field)
    (each time the program runs the file might end up with some hundreds of records)
    thanks in advance!

  2. #2
    Stoned Witch Barney McGrew's Avatar
    Join Date
    Oct 2012
    Location
    astaylea
    Posts
    420
    Store each record into a struct and insert it in an ordered binary tree, then write each record in order.

    edit: Or just use "sort -n".

  3. #3
    Registered User
    Join Date
    May 2011
    Posts
    116
    Quote Originally Posted by Barney McGrew View Post
    Store each record into a struct and insert it in an ordered binary tree, then write each record in order.

    edit: Or just use "sort -n".
    do I have to use a bash script for this?
    also if I want the sorting to be according the second field (the float) how can I achieve that with one command?

  4. #4
    Stoned Witch Barney McGrew's Avatar
    Join Date
    Oct 2012
    Location
    astaylea
    Posts
    420
    do I have to use a bash script for this?
    Nah. My first suggestion involves writing a C program and my second one involves using a program that does a similar thing in a more generic way. You don't need to write a bash script for either, but you could involve one in both cases if you wanted.

    also if I want the sorting to be according the second field (the float) how can I achieve that with one command?
    "sort -n -k 2" should work.

  5. #5
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    While the idea of Barney seems nice, you might don't have a tree ready.

    In that case, I would suggest you to retrieve the data in an array of structs (O(n)), sort the array (O(nlogn)) -using mergesort or quicksort (which comes from stdlib.h ready to use in qsort and finally put the data back in file (O(n)), so the time complexity is O(nlogn), which is fine for such operations.

    Notice, that in case you use qsort, then you could have two comparison functions. One for sorting based on the first field and one for sorting based on the 2nd field. Then you could have a boolean flag (can be done with an integer variable, by using 0 or 1, if you don't want to include the relative header file), which would choose whether to call qsort with the first comparison function (thus sorts based on the 1st field) or to call qsort with the first comparison function (thus sorts based on the 2nd field).

    With Barney's way, the time complexity would be the same, O(nlogn), since insertion in a balanced tree take O(logn).

    Oh and by the way you can use this function of mine to copy files, but of course the source and target file would be the same in your case (I suppose).

    Hope that helps!
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  6. #6
    Registered User
    Join Date
    Nov 2012
    Posts
    1,393
    Quote Originally Posted by std10093 View Post
    While the idea of Barney seems nice, you might don't have a tree ready.
    If you want a ready-to-use tree, FreeBSD provides the header file tree.h

    You just need to download the files tree.h and cdefs.h. One place to get them is here: [base] Index of /user/attilio/membarclean/sys

    There are also manpages online with example of how to use the functions. It requires more boilerplate code than using qsort(3) but it can have speed advantages

  7. #7
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    Quote Originally Posted by c99tutorial View Post
    It requires more boilerplate code than using qsort(3) but it can have speed advantages
    The substring of the above string separated by the "but" is true :P. The other needs elaboration..........
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  8. #8
    Registered User
    Join Date
    Nov 2012
    Posts
    1,393
    Quote Originally Posted by std10093 View Post
    The substring of the above string separated by the "but" is true :P. The other needs elaboration..........
    Using qsort to sort input lines works fine for me, but when I do the same thing using RB_INSERT from tree.h, the result executes about 25% faster. This is with what I hope is the simplest usage possible to accomplish the problem of sorting lines in the input. If you're curious you can try the sample code for yourself

    Private Misc: FreeBSD tree.h Example

    Now compare to the case where you read all the lines into an array and use qsort on it. The result is the same albeit with slower performance.

  9. #9
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    When it comes to using trees for sorting, Splay trees are the most awesome, and beat any other tree type for typical read-world data.

    what is the best way to sort it?
    The best way is to spend most of your time defining what you mean by "best", and only then actually implementing something. Here are a few criterion for you:

    Debugable.
    Ease of understanding.
    Standards conforming.
    Portability.
    Stability.
    Scalability.
    Average Speed.
    Worst case speed.
    Speed for small N.
    Memory usage.
    Code length.
    Genericity.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  10. #10
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    These tree's are not the same as the heaps used in Heapsort?

  11. #11
    Stoned Witch Barney McGrew's Avatar
    Join Date
    Oct 2012
    Location
    astaylea
    Posts
    420
    It's probably easier just to use a simple binary tree since they're very easy to implement and they're ideal for this particular task. With an array you need to make define a maximum limit (unless you implement a growing array, which can be inefficient if you need to shift the data around every time you do a reallocation) and you can't directly insert elements in order into it without difficulty, so you need to store each element, unordered, into it before running a function that will sort it. And with a linked list it usually takes much longer to insert ordered data into it, especially for larger data sets. I wrote a similar program (Ideone.com | Online C99 Compiler & Debugging Tool) using a binary tree that sorts lines in files. It's only around 60 lines of code, and I'd say it's easier to implement than some of the more complicated sorting algorithms.

    I don't really think it's necessary to use the more specialised, and harder to implement, data structures unless you're sure that it will fix a significant bottle-necks in your code. In any case, it's always important to profile your code after you get it running well.

  12. #12
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Barney, did you ever test it against qsort(), etc.? The reason I ask is that we do get inquiries from people needing massive sorting jobs done. REALLY BIG jobs!

  13. #13
    Stoned Witch Barney McGrew's Avatar
    Join Date
    Oct 2012
    Location
    astaylea
    Posts
    420
    I haven't currently run any tests on it, but the performance of the method you suggest varies based on a few factors:

    - How your C library implements qsort. C doesn't specify which sorting algorithm must be used, the choice is up to the implementation. The results may be significantly different if one implementation uses insertion sort and another uses quick sort.
    - How you choose to allocate memory for the array and store it. This isn't likely to be significant performance-wise, but it's an extra problem to worry about.

    And there's also the amount of data as well as the order it's presented in, which is likely to vary regardless of what algorithms/data structures you decide to use.

    With that said, I decided to compare it against the following code:

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    #define MAX_LINES 6000000
    
    static char *dup(const char *s)
    {
        char *r;
    
        if (!(r = malloc(strlen(s) + 1)))
            exit(EXIT_FAILURE);
        return strcpy(r, s);
    }
    
    static int compar(const void *a, const void *b)
    {
        return strcmp(*(char *const *) a, *(char *const *) b);
    }
    
    static void print(char *const *s, size_t n)
    {
        while (n--)
            fputs(*s++, stdout);
    }
    
    int main(void)
    {
        static char *a[MAX_LINES];
        char buf[8192];
        size_t i = 0;
    
        while (fgets(buf, sizeof buf, stdin) && i < sizeof a / sizeof a[0])
            a[i++] = dup(buf);
        qsort(a, i, sizeof a[0], compar);
        print(a, i);
        return 0;
    }
    and I had the following results (btree.c is the code I linked in my previous post, and array.c is the code pasted above):

    ~ $ gcc -std=c99 -march=native -Wall -pedantic array.c -o array
    ~ $ gcc -std=c99 -march=native -Wall -pedantic btree.c -o btree
    ~ $ time ./array < test > tsta

    real 0m5.601s
    user 0m5.111s
    sys 0m0.210s
    ~ $ time ./btree < test > tstb

    real 0m3.616s
    user 0m3.197s
    sys 0m0.289s
    ~ $ diff tsta tstb
    ~ $ time ./array < test > /dev/null

    real 0m5.154s
    user 0m5.025s
    sys 0m0.117s
    ~ $ time ./btree < test > /dev/null

    real 0m3.369s
    user 0m3.200s
    sys 0m0.163s
    Note that those were compiled without optimisations, and the file test is basically just the plaintext version of the C89 standard catted 360 times (5,129,640 lines, with each line < 80 characters.) The array version has shorter and perhaps simpler code, but it comes at the price of a fixed maximum number of lines. The times made by both programs are fairly good, at least in comparison with other methods, but I'd still say that a binary tree is the most practical choice in this case for its ease of implementation, insertion, and speed.
    Last edited by Barney McGrew; 02-17-2013 at 05:24 AM.

  14. #14
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Interesting. Tree's have a distinct advantage whenever the number of items being worked on, is unknown. Good run times, also!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Sort my file
    By Gil Carvalho in forum C Programming
    Replies: 13
    Last Post: 06-21-2012, 01:32 PM
  2. File I/O with sort
    By yigster in forum C Programming
    Replies: 2
    Last Post: 06-08-2010, 08:29 PM
  3. Sort a txt File
    By Achilles in forum C++ Programming
    Replies: 5
    Last Post: 02-19-2003, 06:39 AM
  4. Shell Sort vs Heap Sort vs Quick Sort
    By mackol in forum C Programming
    Replies: 6
    Last Post: 11-22-2002, 08:05 PM
  5. VC++ Sort File
    By Duncan Booth in forum C++ Programming
    Replies: 1
    Last Post: 05-27-2002, 12:06 PM