Splitting up a bunch of lines for processing

**Asymptotic** · 08-21-2019

I want to write a program which will open up a large plaintext file with many thousands of lines, split that file into n number of chunks, and then pass each chunk to a different thread for processing (searching). I envision the workload like this:

1.) Accept input file name from user
2.) Attempt to open the file (exit on fail)
3.) Get the total line count (or total byte count)
4.) Split the above into even sections (Im guessing I may have to round up/pad in order to evenly split the file) using separate heap allocations (malloc)
5.) Pass 2 args to each thread: a) the search term and b) a pointer to that thread's chunk of memory to search
6.) Each thread runs the same search function on separate data, and prints the results with line numbers. I'm debating on whether this should be a line-by-line search or just the positions of linebreaks should be accounted to but the actual search is done on a larger chunk of data
7.) The program exits

So in summary, I have 2 areas of concern:

1.) Splitting up the workload properly so that it receives 100% search coverage. For example, lets say there are 4184737 bytes in the file to be searched and there are 6 threads. 4184737 / 6 is not an even number. Therefore, the workload split seems like it will be misaligned and/or we may miss a search term basically. To accomodate this, I was thinking of rounding the number of bytes up with 0x00 padding to the nearest number divisible by 6 for example.

2.) The chunk size of data that each thread should perform 1 search function on. I probably will not use strstr but for example, this problem deals with do we use strstr on a string whos strlen is 80 or on a 4000 byte chunk? My inclination is to only operate on lines (split on \n basically) so that the line #s can easily be tracked, but this may be slower.

Do any of you folks with more concurrent programming experience have any recommendations here? The idea is that this program will be much faster to search because instead of having 1 thread search a large file top to bottom, we will have 6 threads (for example) split the file up and search it at the same time.

Thread: Splitting up a bunch of lines for processing

Thread Tools

Search Thread

Display

Threaded View

Splitting up a bunch of lines for processing

Similar Threads

bunch of warnings

splitting lines

Bunch of Bugs

Umm, how do I tab over a whole bunch of code?

a bunch of little things...

Tags for this Thread