Thread: Splitting up a bunch of lines for processing

  1. #1
    Old Fashioned
    Join Date
    Nov 2016
    Posts
    137

    Splitting up a bunch of lines for processing

    I want to write a program which will open up a large plaintext file with many thousands of lines, split that file into n number of chunks, and then pass each chunk to a different thread for processing (searching). I envision the workload like this:

    1.) Accept input file name from user
    2.) Attempt to open the file (exit on fail)
    3.) Get the total line count (or total byte count)
    4.) Split the above into even sections (Im guessing I may have to round up/pad in order to evenly split the file) using separate heap allocations (malloc)
    5.) Pass 2 args to each thread: a) the search term and b) a pointer to that thread's chunk of memory to search
    6.) Each thread runs the same search function on separate data, and prints the results with line numbers. I'm debating on whether this should be a line-by-line search or just the positions of linebreaks should be accounted to but the actual search is done on a larger chunk of data
    7.) The program exits

    So in summary, I have 2 areas of concern:

    1.) Splitting up the workload properly so that it receives 100% search coverage. For example, lets say there are 4184737 bytes in the file to be searched and there are 6 threads. 4184737 / 6 is not an even number. Therefore, the workload split seems like it will be misaligned and/or we may miss a search term basically. To accomodate this, I was thinking of rounding the number of bytes up with 0x00 padding to the nearest number divisible by 6 for example.

    2.) The chunk size of data that each thread should perform 1 search function on. I probably will not use strstr but for example, this problem deals with do we use strstr on a string whos strlen is 80 or on a 4000 byte chunk? My inclination is to only operate on lines (split on \n basically) so that the line #s can easily be tracked, but this may be slower.

    Do any of you folks with more concurrent programming experience have any recommendations here? The idea is that this program will be much faster to search because instead of having 1 thread search a large file top to bottom, we will have 6 threads (for example) split the file up and search it at the same time.
    If I was homeless and jobless, I would take my laptop to a wifi source and write C for fun all day. It's the same thing I enjoy now!

  2. #2
    Registered User
    Join Date
    Dec 2017
    Posts
    1,626
    It would be slower to do it that way then to just search the file in one piece (with one thread).
    A little inaccuracy saves tons of explanation. - H.H. Munro

  3. #3
    Registered User
    Join Date
    May 2019
    Posts
    214
    I offer a slight expansion on @john.c's point.

    Searching for a text match could be a unknown workload unless it is a simple match. If the file is ASCII only you may bypass localization issues. If the search is a regular expression that has it's own "weight" to it. A simple match of a small string, however, is so simple that it's a basic CPU operation.

    Put another way, the search "comparison" has a measure of performance drain, and if that does happen to be extremely heavy, you might get some gain in overall time to complete.

    However, in your example you show around 4 Mbytes. In modern hardware this is a small file. It can be loaded into RAM for most hardware, and text files are often smaller, while some are formatted data (like XML files) reaching gigabytes in size.

    Most hard disks (the rotating hard drive) sustain around 150-190 Mbytes / sec under best case conditions, with very large access interruptions (skipping about for data that isn't sequential on the drive). Scatter data files (which is common) can drop 150 Mbytes / sec best case down to 5 Mbytes / sec or lower.

    SSD drives, however, commonly sustain over 300 Mbytes / sec (with some exceptions downward), with many over 500 Mbytes / sec.

    Compared to any search algorithm these speeds are slow, but some of the styles for reading data from files impose overhead.

    RAM speeds hover around 20 GBytes / sec on modern hardware (anywhere from 15 to 60 GBytes / sec), which is one significant part of the search problem. Where RAM becomes a primary bottleneck, this is a resource common to all threads.

    Let me put that another way.

    If your comparison is quick, and all threads can rip through RAM at near the maximum speed of RAM, it won't matter how many threads are running - they're all saturating the same, singular resource which can't be multiplexed in common hardware.

    What's worse, threading can cause "thrashing" among CPU cores. This is where multiple cores are competing for what is a single source of RAM, and each CPU must demand different addresses in that RAM. Each address specification causes significant delays when compared to sequential reading through RAM. Where the comparison is fast, and thus the CPU's are saturating speed, this can actually cause some threaded algorithms to run slower than single threaded applications of the same basic function.

    The exception is where the comparison is very heavy. This means the CPU must work hard on a small task from a chunk of memory loaded into cache (a limited resource), such that the algorithm can't saturate the RAM's potential for providing data. In this case threading can be quite beneficial. The key is the balance between CPU workload and the speed at which the algorithm consumes RAM.

    Engineers (programmers are engineers, or should aspire to act like one) would experiment and measure. There can be an "exact balance" point between RAM throughput and algorithmic complexity (like a heavy comparison operation) which changes the dynamics regarding whether threading actually helps.

    An example of this is threaded, 3 pivot quicksort. Most "examples" sort by integers, about the fastest comparison a CPU can do. A 3 pivot quicksort divides the sorted data into 4 sections at each pass (the original just divided into 2).

    Bottom line, quicksort is heave on RAM throughput. It does "bounce around" the data badly, getting worse as the algorithm proceeds. Threading is one level of performance concern, and the 3 pivot method is another. The benefit to the 3 pivot is that it finishes with far fewer overall comparisons than the standard quicksort. The downside is that the algorithm isn't as cache friendly.

    Here, again, it all depends on the comparison. Where the comparison is an integer, the threaded 3 pivot quicksort is actually slower than standard quicksort. Where the comparison is complex, like a language independent UTF-8 string sorted by various languages (from French to Japanese), the threaded 3 pivot quicksort gains power because it reduces the comparisons (which are the heavier part).

    This is part of the engineering that goes into these kinds of decisions.

    You should complete the experiment and compare. That's the engineering habit.

  4. #4
    Old Fashioned
    Join Date
    Nov 2016
    Posts
    137
    Wow, Niccolo thanks for getting my thinking going. I totally understand your experimentation suggestion, and I love experimenting. In fact, this entire project is basically an experiment. But it's nice to hear ideas from other more experienced folks to get my head thinking.

    One of my favorite things about C is that it can get me into thinking about the hardware that the code is actually running on.

    By the way, Niccolo, do you have any good reference material for algorithms and their effect on actual hardware like you described? Most algorithmic publications seem to be totally theoretical and disregard actual hardware problems faced by practicing programmers.

    I like the memory/CPU discussion you sparked here. I forgot to consider the speed of the memory itself.. In fact, it seems the cache wouldnt help much here either since each thread would be going through entirely unique data just once before quitting. What I'm getting from this is that once I reach a certain sweet spot, a faster machine may destroy a slower machine due to things like RAM speed and SSD.

    Funny you bring thgis up because today I'm completing a build with a 64GB of 3600Mhz RAM, a 12 core Ryzen 9 3900X, and a NVMe PCIe 4.0 SSD + 8 GB RX 580 graphics card. Will be interesting to experiment using that vs my laptop.
    If I was homeless and jobless, I would take my laptop to a wifi source and write C for fun all day. It's the same thing I enjoy now!

  5. #5
    Registered User
    Join Date
    May 2019
    Posts
    214
    @Asymptotic, glad you like that

    "One of my favorite things about C is that it can get me into thinking about the hardware that the code is actually running on."

    Indeed, C was originally created to re-write the UNIX operating system, translating from the original written in assembly into a chip independent assembler like language, which became C.

    "By the way, Niccolo, do you have any good reference material for algorithms and their effect on actual hardware like you described?"

    Ouch. I have to point toward white papers and such. I don't know this kind of thing is much discussed in books, or when they are it is part of something like a discussion on real time physics engine design. I've been an engineer for decades, and most of what I know about these points come from white papers, a few blogs, and my own experiments. I spent maybe three months focused on variations of quicksort (introsort) using a policy based implementation in C++, so I could substitute variations on the fly to experiment carefully on 2 pivot and 3 pivot versions of quicksort, varations of the tail phase (insert/merge/etc), and threading each of them.

    Basically, some books are probably due. However, engineers are consumers of science, and so scientific papers on the subject are a rich, dense, sometimes cryptic source of a combination of theory and experiment on that base of knowledge which feeds the more practical experience of making software.


    "I'm completing a build...."

    Nice machine! In the mid 90's, that would probably represent $20 million in hardware or more. It's a great time to be in this industry!

  6. #6
    Old Fashioned
    Join Date
    Nov 2016
    Posts
    137
    Oh by the way I forgot to mention... Although my example was around a few megabytes, I'd be actually using this program to handle like 4-10GB files and up to 40GB at a time. I definitely would not spin up that many threads to just search thru 4MB on most modern machines.
    If I was homeless and jobless, I would take my laptop to a wifi source and write C for fun all day. It's the same thing I enjoy now!

  7. #7
    Old Fashioned
    Join Date
    Nov 2016
    Posts
    137
    Niccolo,

    Thought you might find this article interesting: ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} - Andrew Gallant's Blog

    These are the types of challenges I had been thinking of. Well, my search program is much simpler because I don't need to support regular expressions, and it turns out, strstr() is SUPER FAST. I was able to run it on 3 million lines of text in like .16 seconds and print results.

    But, obviously grep support is entirely different, but still an interesting group of solutions.
    If I was homeless and jobless, I would take my laptop to a wifi source and write C for fun all day. It's the same thing I enjoy now!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. bunch of warnings
    By fouzimedfouni in forum C Programming
    Replies: 8
    Last Post: 02-18-2015, 10:15 PM
  2. splitting lines
    By l2u in forum C++ Programming
    Replies: 5
    Last Post: 10-06-2006, 04:10 AM
  3. Bunch of Bugs
    By ElastoManiac in forum Windows Programming
    Replies: 6
    Last Post: 12-11-2005, 03:49 PM
  4. Umm, how do I tab over a whole bunch of code?
    By funkydude9 in forum C++ Programming
    Replies: 1
    Last Post: 12-23-2002, 11:11 AM
  5. a bunch of little things...
    By dbaryl in forum C Programming
    Replies: 8
    Last Post: 05-15-2002, 08:16 PM

Tags for this Thread