Thread: Should I use threads to do the file io

  1. #1
    Registered User
    Join Date
    May 2009
    Posts
    72

    Should I use threads to do the file io

    Hi All

    I'm writing a tool which processes data. The amount of data can be huge, also multiple files.
    So what I need to do is the following: 1) read from all files, for example, 1024 samples, 2) process the 1024 samples from all files and 3) repeat (go back to 1)

    To optimize this I thought threads could help. So the file IO is in one thread and the processing in an other one. Which means the io thread makes sure there are always enough samples available for processing so the processing can continue without waiting.

    The questions I have are: If using pthreads is the io not blocking all other threads ? and should I use 1 thread per file or should I deal with all files in 1 thread.

    If you have other ideas / suggestions how to optimize this I would really like to hear them!

    cheers
    Luca

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,661
    What have you achieved?
    "blocked waiting for I/O" is now "blocked waiting for a thread, waiting for I/O".

    Disk seek times are still measured in >1mS, whereas your CPU clock is <1nS.
    That's a factor of >1E6 between them.
    Unless you're doing millions of calculations on each block, you're going to be I/O bound.

    Without doing any processing at all, time how long it takes just to read ALL the data.
    - Say this takes an hour
    Without doing any file reading, time how long it takes to process an average block 'n' times.
    - Say this takes 10 minutes

    Careful threading might cause a bit of overlap so that the total time for both is 1:05.
    But the elephant in the room is physically reading the files.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    I'm curious if you've tried running multiple instances of your program, with just one thread each?

    I recently did a project for Sudoku that involved processing over a million files, each with tens of thousands of potential Sudoku puzzle grids. Despite using just one drive, I found running 8 instances of the program, was no problem, and it really increased throughput. (less and less increase as a per cent with each additional instance) The drive was no problem, because I used a RAM drive for all the primary files. That cuts the access and write times down by a large factor.

    To keep it organized, I had each instance using it's own directory on the RAM drive.

    Just make sure you save your program saves it's work every 30 minutes or so, to the HD.

  4. #4
    Registered User
    Join Date
    May 2009
    Posts
    72
    well, I need to process all files simultaneously (I need to combine the data), so I cannot do this with multiple instances!

    But if IO isn't blocking the processing thread, this will be the best solution (and probably not very hard to implement!)

    cheers

    UPDATE: One last thought: reading the files one after the other -- would that take as much time if you read them all together at the same time using threads ?
    Last edited by jeanluca; 04-24-2010 at 12:03 PM.

  5. #5
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,661
    I hope this combined data isn't spread over multiple threads.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #6
    Registered User
    Join Date
    May 2009
    Posts
    72
    So you would suggest to create 1 thread which reads all files, and not a thread for each file, right ?

  7. #7
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by jeanluca View Post
    So you would suggest to create 1 thread which reads all files, and not a thread for each file, right ?
    That's what I'd think. With I/O as the bottleneck, there is no point in multi-threading the various reads -- even if you have multiple processors, you are still dealing with one hard drive with one read head in it. So it cannot happen concurrently (otherwise you could read all of a file of any length instantaneously).

    If you have one "producer" thread reading the files and queuing the data, if the processing takes longer, the queue will grow, in which case you could have multiple threads for the processing. However, that is very unlikely unless you are taking a relatively small file and doing lots of intense work with it. So most likely, there will also be one "consumer" thread which spends most of it's time waiting for the more data.

    That's two threads. If more are beneficial, they will be beneficial on the data processing side, not the file reading side. There cannot be a purpose to multiple threads just reading files off one device.
    Last edited by MK27; 04-24-2010 at 01:02 PM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  8. #8
    Registered User
    Join Date
    May 2009
    Posts
    72
    thats what I will do, a producer and consumer thread!

    thnx a lot
    Luca

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. realloc, malloc and file IO, the hell?
    By Blumfan in forum C Programming
    Replies: 3
    Last Post: 11-23-2009, 02:01 PM
  2. Can we have vector of vector?
    By ketu1 in forum C++ Programming
    Replies: 24
    Last Post: 01-03-2008, 05:02 AM
  3. help with text input
    By Alphawaves in forum C Programming
    Replies: 8
    Last Post: 04-08-2007, 04:54 PM
  4. Possible circular definition with singleton objects
    By techrolla in forum C++ Programming
    Replies: 3
    Last Post: 12-26-2004, 10:46 AM
  5. File IO with .Net SDK and platform SDK
    By AtomRiot in forum Windows Programming
    Replies: 5
    Last Post: 12-14-2004, 10:18 AM