Thread: How to Parallel reading flat file into C ?

  1. #1
    Registered User
    Join Date
    Feb 2006
    Posts
    22

    How to Parallel reading flat file into C ?

    Hi All,
    I want to read a c/Pro*c program to read a flat files parallel.Does anyone have any idea about that or By anyluck have implemented.Is it any buiilt in function exist into C ?

    Any idea or thought would be appreciable.



    Thank n advance.

  2. #2
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    fgets + sscanf = read and scan a line at a time.


    Quzah.
    Hope is the first step on the road to disappointment.

  3. #3
    Registered User
    Join Date
    Feb 2006
    Posts
    22
    thnx fr quick response.Sorry i ddnt get.what exactly you are trying to say.

    my requirement is i have Flat file and i have to read that file parallel(split up)and needs to do some checking and then again merge it.the output file will be the cleansed file.


    Anwar

  4. #4
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    Quzah answered your question: with the standard C functions fgets() and sscanf() you can read a line at a time from the input file, parse it, and then write it out again (with fputs()).

    Something along the lines of this:
    Code:
    while(fgets(buffer, sizeof(buffer), fp)) {
        if(sscanf("%i: %s", &somedata, somemoredata) != 2) {
            puts("Malformed line");
        }
        else {
            fputs(buffer, out);
        }
    }
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  5. #5
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > my requirement is i have Flat file and i have to read that file parallel(split up)
    If by that you mean the file is in several parts (actually separate files), then just open up each one for reading, use fgets() to read the first record from each one, then examine all the records (one from each file) to decide what to do next.

    When you've processed a record, replace it with the next record from the corresponding file.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #6
    Registered User
    Join Date
    Feb 2006
    Posts
    22
    I mean i have only one file and to improve the performance i have to read parallely and write into mulutiple files parallel and perform some operations then again merge it.

    is it achivable with fgets()+sscanf() ?

  7. #7
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by anwar_pat
    I mean i have only one file and to improve the performance i have to read parallely and write into mulutiple files parallel and perform some operations then again merge it.

    is it achivable with fgets()+sscanf() ?
    I understand you to mean that you have one very large file, and to "process it quickly", you want to open that file say three times, adjust the file position index's to say, 1/3rd the length of the file, and then process each of the three data streams from the one file, in parallel. Ending up with one large file at the end.

    Here's why I think that's a bad idea:

    1) Every time you change to the next data stream, your L1 and L2 cache (which is your REALLY fast memory), will be completely flushed out, and have to be re-filled from the file buffer. Big slowdown here. It's like instead of reaching over to the desktop and getting a letter from there, you had to walk out to the mailbox at the curb to get that letter.

    2) Everytime you change the data stream, you will have to rebuild the data in the drive controller's cache of memory. These are sizeable 8 to 16 Megs of RAM, which pre-load the next data in the file you're currently accessing, in anticipation of you wanting that data, next.

    Losing that pre-loaded data, is like taking a trip to the post office to get that letter, in our letter analogy. You REALLY don't want that, imo.

    3) You'd be doing #1 and #2 above, repeatedly, while the "parallel" program was running.

    Here's another idea: If you need to work with the data very quickly, and REALLY need to do it in parallel, use a file splitter program like "HJSplit" (free), to split the original file up into splits that make sense. Then open all the split files, you wish, at the same time, and see if you can load the data into huge arrays, ONE AFTER THE OTHER, (not interleaved), and then work with your data and write it out again.

    You maximize your drive buffer, your cache hits, and you're able to work with the data all in memory (or at least a good chunk of it. The "chunks" in the arrays may need to be re-loaded several times due to memory constraints). Last time I did something like this, I had to keep the number of records I loaded down to 2,000 at a time.

    Be careful NOT to invoke "virtual memory" from your system's hard drive, by asking for too much data at any one time. Virtual memroy is just using the hard disk, and it will be a HUGE slowdown, in every way.

    Hope that helps.

    Adak

  8. #8
    Registered User VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,607
    I understand you to mean that you have one very large file, and to "process it quickly", you want to open that file say three times, adjust the file position index's to say, 1/3rd the length of the file, and then process each of the three data streams from the one file, in parallel. Ending up with one large file at the end.

    Here's why I think that's a bad idea:

    1) Every time you change to the next data stream, your L1 and L2 cache (which is your REALLY fast memory), will be completely flushed out, and have to be re-filled from the file buffer. Big slowdown here. It's like instead of reaching over to the desktop and getting a letter from there, you had to walk out to the mailbox at the curb to get that letter.

    2) Everytime you change the data stream, you will have to rebuild the data in the drive controller's cache of memory. These are sizeable 8 to 16 Megs of RAM, which pre-load the next data in the file you're currently accessing, in anticipation of you wanting that data, next.

    Losing that pre-loaded data, is like taking a trip to the post office to get that letter, in our letter analogy. You REALLY don't want that, imo.

    3) You'd be doing #1 and #2 above, repeatedly, while the "parallel" program was running.

    Here's another idea: If you need to work with the data very quickly, and REALLY need to do it in parallel, use a file splitter program like "HJSplit" (free), to split the original file up into splits that make sense. Then open all the split files, you wish, at the same time, and see if you can load the data into huge arrays, ONE AFTER THE OTHER, (not interleaved), and then work with your data and write it out again.

    You maximize your drive buffer, your cache hits, and you're able to work with the data all in memory (or at least a good chunk of it. The "chunks" in the arrays may need to be re-loaded several times due to memory constraints). Last time I did something like this, I had to keep the number of records I loaded down to 2,000 at a time.

    Be careful NOT to invoke "virtual memory" from your system's hard drive, by asking for too much data at any one time. Virtual memroy is just using the hard disk, and it will be a HUGE slowdown, in every way.

    Hope that helps.

    Adak
    Given the simplicity of this task and given the speeds of computers these days none of what you said is going to make any difference unless the file size is at least 30 to 40% of the entire hard drive.

    1) Every time you change to the next data stream, your L1 and L2 cache (which is your REALLY fast memory), will be completely flushed out, and have to be re-filled from the file buffer. Big slowdown here. It's like instead of reaching over to the desktop and getting a letter from there, you had to walk out to the mailbox at the curb to get that letter.
    Ok, and why is that since the L1 and L2 cache contain the most recently and frequently used instructions? I highly doubt a stream switch completely flushes the cache.

    2) Everytime you change the data stream, you will have to rebuild the data in the drive controller's cache of memory. These are sizeable 8 to 16 Megs of RAM, which pre-load the next data in the file you're currently accessing, in anticipation of you wanting that data, next.
    It does not load 8 to 16 megs of RAM each time you switch streams. What if the file isn't 8 to 16 megs at all? The drive controller cache is used for high speed data bursts using read ahead - and given that it is enabled in Windows.
    Just because you switch files does not mean they completely reset the 8 to 16 megs of RAM in the controller. This would be horribly inefficient. I suspect they have a much better approach than this. My wild guess would be that they simply increment a pointer in the memory and then write to this area. Once you arrive at the end of the memory they simply wrap it and overwrite any previous data. Overwriting previous data is much faster than clearing the memory. If it did clear the memory each time Windows would crawl to a stop. There are at any one time probably 15 to 20 processes all asking for disk access from different files on different parts of drives and/or multiple drives. If they cleared the memory each time a process needed to access another file or stream of data....it would be a slideshow.

    Salem posted a link some time ago to a utility that will inform and monitor all disk access on your system. Just from looking at the output of that program, there is no way what you say is true. The main OS drive is in a constant state of flux. As well my scans with Diskkeeper 10 also confirm this since the main OS drive get's completely fragged in less than 24 hours of operation. And we are also forgetting about the paging system inherent in protected mode programming which means that pages of memory are swapped out of physical RAM onto the drive - again by your definition this would be a cache flush and a reset on the drive controller RAM. I hardly think so.




    Just use what quzah suggested. No tricks here just good old file I/O.
    Last edited by VirtualAce; 09-15-2006 at 03:53 AM.

  9. #9
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Obviously, I've tried to describe something in simpler terms and descriptions, than it has in fact.

    What you're apparently understanding as my meaning, is not what I meant, and what I meant is a gross simplification.

    But the general idea is correct.

    You take the simple file i/o, and I'll take my way, " and I'll be Scotland 'afore ye." :P

    Adak

  10. #10
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    You just keep telling yourself that. Are you making this multi-threaded? Somehow I doubt it. If you aren't, everything is going to be handled sequentially anyway. If you are, my understanding of C-threads are that they're really just faked threading anyway. Most just pretend to be multi-threaded, by giving a "time slice" to each action, and skipping through the list of "threads" doing their "time slice".

    In either case, unless you have true multi-threading, you're still going to end up with the same number of actions taking the same amount of time. Why? I can call three calls to fread one after another, just using one file pointer, where you call three calls to three different spots.

    Hell, mine will probably end up being faster, because I don't have to worry about my first read running over top of something I've already read. I have no checks other than one EOF check, where you have to calculate the size of the file, split it into N chunks, and for each read, make sure you don't run over the start point of another segment.


    Quzah.
    Hope is the first step on the road to disappointment.

  11. #11
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    Maybe, but
    The purpose of using the POSIX thread library in your software is to execute software faster.
    From http://yolinux.com/TUTORIALS/LinuxTu...ixThreads.html

    I was always under the impression that multithreaded programs might be the same speed or a little bit slower (because of the time slicing involved), but they might be much faster if your computer has multiple CPUs or if a thread is limited to an amount of CPU time.

    [edit] Maybe soon, from http://www.nus.edu.sg/comcen/svu/pub...c-trends.html:
    An interesting true multithreaded machine on the horizon is the Tera, which is currently under test runs at the University of California, San Diego (http://www.tera.com/). A thread, in essence, acts like a mini-processor. As a program runs, it is broken down into individual tasks, and each task is executed by a different thread. Each of the Tera's physical processors can handle up to 128 threads simultaneously, switching automatically from one thread, which is waiting for data to return from memory, to another thread that is ready to execute. In effect, the threads keep each processor busy almost all the time. Encouraging performance results have been published for both scientific and commercial applications.
    [/edit]
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  12. #12
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    I'm sure we have been de-railed, here.

    POSIX threads are a definite speed up but not mentioned by the OP, so not a part of this.

    My point was that a simple file i/o algorithm, with several files in parallel, was inefficient:

    e.g.:

    input file #1: data item
    input file #2: data item
    input file #3: data item
    input file #4: data item
    etc.

    Code here to do something with the data items
    Now more input

    input file #1: data item
    input file #2: data item
    input file #3: data item
    etc.

    My suggestion was to read as much as possible from as few files as practical, and then process that data. If splitting the file up helped in that effort, fine -otherwise, don't do it.

    Then return to reading another HUGE batch of data from as few files as possible (hopefully putting it into an array if practical), and then process that data.

    I have some experience with this kind of programming, and in that experience, I learned the way to do this was to input AS MANY fields or records or data items as you could from as few files as possible, and process it. I used arrays, quicksort, and binary searches in my program.

    In a little less than 30 minutes, three databases would be cross-checked for accuracy, and updated as well. That was a HUGE time savings.

    Adak

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Newbie homework help
    By fossage in forum C Programming
    Replies: 3
    Last Post: 04-30-2009, 04:27 PM
  2. Need Help Fixing My C Program. Deals with File I/O
    By Matus in forum C Programming
    Replies: 7
    Last Post: 04-29-2008, 07:51 PM
  3. Reading flat file and generating tagged file
    By AngKar in forum C# Programming
    Replies: 4
    Last Post: 03-24-2006, 08:29 AM
  4. Possible circular definition with singleton objects
    By techrolla in forum C++ Programming
    Replies: 3
    Last Post: 12-26-2004, 10:46 AM
  5. System
    By drdroid in forum C++ Programming
    Replies: 3
    Last Post: 06-28-2002, 10:12 PM