Thread: File I/O in large data file

  1. #1
    Registered User
    Join Date
    Mar 2007
    Posts
    9

    File I/O in large data file

    I'm not requesting for codes but I'm currently stuck at one section of my work and in need of an algorithm or way to overcome this.

    I need to read a file that contains records which I've already done by the simple File I/O functions but now I need to read at least 1.5million records in order to do data sorting. I'm unfamiliar with memory management as I think 1.5million records is huge. I've constructed a struct to store a record each and as a linked-list structure but taking the time to read/write and sort into consideration it'll take days to run it. I tried initializing array[1500000] to give a try but it seems to end up as a core dumped.

    Therefore, I'm requesting for an algorithm, a pseudo code or any ideas to overcome this. Codes would be welcomed but I'm not asking too much as I know it is wrong for someone to do someone else's work for free.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Break the file into several parts with a size you can manage.
    Sort each one.
    Merge the sorted files.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Mar 2007
    Posts
    9
    Implementing a merge sort?

    I was thinking about implementing a quicksort seeing that it is the fastest sort. I'm wondering if there is a way to keep 1.5million records in memory during program runtime?

    I was thinking of an array of structs. How will I dynamically increase the array size as it approaches maximum capacity?

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    How big is your struct?
    How big is your array?
    You made it sound like it would not fit in memory.

    If you're saying you have the physical memory, then yes you can load the whole thing and sort it in one go.

    Did you perchance just create a 1M+ entry array on the stack and have it blow up?
    Because for such large numbers, you should be dynamically allocating the space.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by ryujinn View Post
    Implementing a merge sort?

    I was thinking about implementing a quicksort seeing that it is the fastest sort. I'm wondering if there is a way to keep 1.5million records in memory during program runtime?

    I was thinking of an array of structs. How will I dynamically increase the array size as it approaches maximum capacity?
    For speed, you don't want to dynamically re-allocate anything.

    For Quicksort:
    Test your system, and pick a conservatively safe array size. Use that as your array size. Now run your program & load your data into that array. Quicksort it, writing out the sorted results in dat1 file

    Then loop back, reload the array, and quicksort it, writing it out to dat2 file

    When fscanf tells you it's at the EOF, call your mergeIt() function to merge all the dat files back into one sorted file. Ideally, you always want two evenly sized files to merge together, not one GIANT file, and one very small file.

    In some tests with extremely large files, I discovered to my shock, that Windows2000 saves special resources for sorting. The best of 5 quicksort programs (including the C library), was beaten badly by calling the system with the sort command and the necessary file name.
    The OS handled the merging and all the rest, and easily 10X faster.

    I have no idea if WindowsXP or Vista has special resources like Windows2000 for sorting. If you try it, and it has it however, stand by to be amazed.

    Adak

  6. #6
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    For speed, you don't want to dynamically re-allocate anything.
    Actually, as discussed in a recent thread, reading from the disk is quite a bit slower in comparison to realloc().

    In some tests with extremely large files, I discovered to my shock, that Windows2000 saves special resources for sorting. The best of 5 quicksort programs (including the C library), was beaten badly by calling the system with the sort command and the necessary file name.
    The OS handled the merging and all the rest, and easily 10X faster.
    The sort program probably gets higher priority or something . . . .
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  7. #7
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by dwks View Post
    Actually, as discussed in a recent thread, reading from the disk is quite a bit slower in comparison to realloc().


    The sort program probably gets higher priority or something . . . .
    Obviously the data will have to be read from the disk, in either case. That's where the OP has the data, after all.

    What I meant was that reallocating an array can be quite a bit slower than just testing and using one array size that is allocated only once.

    I was hoping to find that info on the resources Windows2000 makes available for sorting, but I just don't use Windows2000 much anymore - no luck. It does mention the smart way it goes about merging any temporary files that may be needed, however.

    Perhaps as clever as the old Ben Baker Sort Utility for DOS. Although it had no special hardware resources, it was a software sorting powerhouse.

    'Dem was da days!

    Adak

  8. #8
    Registered User
    Join Date
    Mar 2007
    Posts
    9
    Well, I'm computing my work in SunOS.

    Break the file into several parts with a size you can manage.
    Sort each one.
    Merge the sorted files.
    Just a question on this algorithm. If I were to break approx 1.5million records. How am I suppose to merge them in the end?

    ie:
    Code:
    First Batch : unsorted
    3 1 6
    
    Sorted Batch :
    1 3 6
    
    Second Batch :  unsorted 
    8 2 4
    
    Sorted Second Batch :
    2 4 8
    If I were to merge them both in the end, correct me if I'm wrong, it would look like this
    1 3 6 2 4 8

    and considering the fact that I'm sorting a huge file roughly 40MB size, it just look unsorted at all.

  9. #9
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    you should look at the merge sort algorithm
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  10. #10
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    40MB shouldn't present any difficulty with doing it all in one step.
    Perhaps post your actual attempt so we can see how you allocated memory, and how you tried to sort it.

    On Merging
    You do this
    - read the first record from both files
    - while both files have records
    -- compare the two records, output which comes 'first' and read the next from the appropriate file
    - endwhile
    - append the file which still has records

    So given
    1 3 6
    2 4 8

    We read r1=1 and r2=2
    r1(=1) <= r2(=2), so output 1 to the output and read the next record (r1=3)
    r1(=3) > r2(=2), so output 2 to the output and read the next record (r2=4)
    And so on...
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  11. #11
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by ryujinn View Post
    Well, I'm computing my work in SunOS.

    Just a question on this algorithm. If I were to break approx 1.5million records. How am I suppose to merge them in the end?

    ie:
    Code:
    First Batch : unsorted
    3 1 6
    
    Sorted Batch :
    1 3 6
    
    Second Batch :  unsorted 
    8 2 4
    
    Sorted Second Batch :
    2 4 8
    If I were to merge them both in the end, correct me if I'm wrong, it would look like this
    1 3 6 2 4 8

    and considering the fact that I'm sorting a huge file roughly 40MB size, it just look unsorted at all.
    I've never worked with the SunOS.

    I'll guarantee the FIRST thing you want to do before anything else, is BACK UP your data files! Then you can take a snippet of it for something easy to work with.

    I'd say second, find out what your system and array size resources are. If you CAN allocate a large array on your system, then great - but if not, I'd sure try another way. Trying to suck a bull moose through a straw, just isn't going to be fun.

    Easiest and probably fastest way would be to use your OS, to sort, then it's easy breezy.

    If you can't or don't want to use your OS for the sort, then you have two choices - quicksort with multi-stage merges, or just straight merge sort. You'll want to read up a bit on Wikipedia regarding mergesort so you know what to expect, etc.

    After you decide on the algo you want to use, and have it working very small samples correctly, I'd try it on something bigger (maybe 10 megs total), and see how it performs.

    The merging works by your program either counting the merge files it is creating (at each level of the merge process), or looking in the directory, to see if there's any more that need to be merged. The actual merging works like a tournament does:

    1 HUGE file >> 100 merge files level 1 >> 50 merge files level 2 >> 24 merge files level 3, etc.
    Level 1 merge file1
    ----------------------------
    ===============Level 2 merge file1 (sorted merge of 1 & 2 on the left)
    Level 1 merge file2
    -----------------------------

    etc. Think "March Madness" NCAA Basketball tournament brackets.

    Other considerations:

    1) Quicksort is a "fragile" algorithm, and easily forced into running in a degraded mode. Take care that you're not trying to get it to sort on nearly already sorted keys, etc.

    2) Can you use this program enough to make it worthwhile investing some time and effort into creating it? If it's a one-time thing, and you don't want to learn about the programming skills that much, - just use a simple merging program design (or the OS).


    Adak

  12. #12
    Registered User
    Join Date
    Mar 2007
    Posts
    9
    Well, I've decided to use the OS's function of sorting which is the qsort(). But after implementing it, it doesn't seem to sort at all. Here's a snippet which is shortened alot :

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <ctype.h>
    
    #define MAXSIZE 100
    
    typedef struct dataTuple
    {
     char name[12];
     char server[30];
    }Tuple;
    
     
    int compare(const void * p1, const void *p2);
    
    int main(int argc, char *argv[])
    {
     
      Tuple *data[6]; /* I've decided to test it with a small value first */
      char line[100];
      int counter, length, i, j=0,k;
     
      /* does file reading and have an array of pointers pointing to struct*/
      /* Note : its rather a big chunk of code which I don't think its the problem
          as I've used a tokenizer so I removed it from here */
    
      for(k=0;k<=6;k++)
      {
        printf("Data : &#37;s\n", data[k]->server);
      }
    
      /* Using quicksort method and sort it by server*/
      qsort(data,6,sizeof(Tuple),compare);
    
      for(k=0;k<=6;k++)
      {
        printf("Data : %s\n", data[k]->server);
        free(data[k]);
      }
    
      return 0;
    }
    
    int compare(const void *p1, const void *p2)
    {
    
      const char *server1 = (const char *)(((Tuple *)p1)->server);
      const char *server2 = (const char *)(((Tuple *)p2)->server);
    
      return (strcmp(server1,server2));
    }
    I printed before sort and after sort, it doesn't seem to sort at all. I'm not sure what's the problem here.

    *EDIT*
    Thanks for the various input guys on the previous posts
    Last edited by ryujinn; 04-01-2007 at 07:04 AM.

  13. #13
    Registered User
    Join Date
    Mar 2005
    Posts
    20
    Quote Originally Posted by ryujinn View Post
    Well, I've decided to use the OS's function of sorting which is the qsort(). But after implementing it, it doesn't seem to sort at all. Here's a snippet which is shortened alot :

    Code:
    int compare(const void *p1, const void *p2)
    {
    
      const char *server1 = (const char *)(((Tuple *)p1)->server);
      const char *server2 = (const char *)(((Tuple *)p2)->server);
    
      return (strcmp(server1,server2));
    }
    Hi, I was just passing by and was looking at the above section of code. Was wondering if you last line of code for returning the value of string compare works. Maybe you should try,
    Code:
    if(strcmp(server1, server2)>0)
       return 1;
    else if(strcmp(server1, server2)<0)
       return -1;
    Not too sure, just a penny for thoughts.

  14. #14
    Registered User
    Join Date
    Mar 2007
    Posts
    9
    I've printed the return values of the strcmp. It seems to work as it gave negative and positive values but when I printed the strings that is being compared to. It printed junk chars which is unreadable. I'm not sure what's going on or why isn't it sorting at all.

    EDIT :
    I've altered the compare function to this
    Code:
    int compare(const void * p1, const void * p2)
    {
      const struct Tuple *ps1 = p1;
      const struct Tuple *ps2 = p2;
    
      int res;
    
      res =  strcmp(ps1->server, ps2->server); /*Error*/
      if(res != 0) /*If key value is not identical*/
         return 0;
      else /*If key value identical, use another key to sort*/
         return strcmp(ps1->name,ps2->name); /*Error*/
    }
    It seems to complain "dereferencing pointer to incomplete type"
    Last edited by ryujinn; 04-01-2007 at 09:11 AM.

  15. #15
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > for(k=0;k<=6;k++)
    1. This runs off the end of the array
    2. You don't show how you allocate it.

    > qsort(data,6,sizeof(Tuple),compare);
    Given an array
    T array[N]; // N elements of type T
    The call would be
    qsort( array, N, sizeof(T), compare);
    You have an array of pointers, yet the size you pass is the size of the whole structure.

    > const char *server1 = (const char *)(((Tuple *)p1)->server);
    Given that the array is of type T, you should be casting to a T* in this function.
    Don't try to cast and dereference all in the same line. It won't produce any less code and it will keep things simple for the reader.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. xor linked list
    By adramalech in forum C Programming
    Replies: 23
    Last Post: 10-14-2008, 10:13 AM
  2. gcc link external library
    By spank in forum C Programming
    Replies: 6
    Last Post: 08-08-2007, 03:44 PM
  3. Large file i/o
    By Mostly Harmless in forum C++ Programming
    Replies: 2
    Last Post: 07-18-2007, 01:48 PM
  4. File I/O problems!!! Help!!!
    By Unregistered in forum C Programming
    Replies: 4
    Last Post: 05-17-2002, 08:09 PM