Need help to write a program in c

**cooljack08** · 11-17-2011

Hi,

I need to write a c program to remove duplicates from a file.
how do i go about writing this program... please help...
I am a newbie in c...

I have thought about 2 ways to proceed with the program...dont know if its correct or even can be done... just a thought..

1. first read a line from a file, then read the second line from the file and compare them. If they are same then delete the second line. Now read the third line and compare it with the first line and so on...

2. or read the file line by line and store it in an array and then make comparisons within the array...

please help...

**laserlight** · 11-17-2011

Originally Posted by cooljack08

1. first read a line from a file, then read the second line from the file and compare them. If they are same then delete the second line. Now read the third line and compare it with the first line and so on...

Incomplete (what if they are different?), but it could work for removing consecutive duplicate lines.

Originally Posted by cooljack08

2. or read the file line by line and store it in an array and then make comparisons within the array...

Yes, though you need to be clear on what comparisons you intend to make.

**cooljack08** · 11-17-2011

thanks 4 your quick reply...

i have a worldlist.txt file...

it has around 700k words... many words are repeated...

so my aim is to remove the duplicate words so that i am left with only unique words...

**laserlight** · 11-17-2011

So, what's your plan now?

**Adak** · 11-17-2011

If I close my eyes, I swear I can hear the "gears" turning and turning!

Go Jack, go!

**~~CommonTater~~** · 11-17-2011

Originally Posted by cooljack08

thanks 4 your quick reply...

i have a worldlist.txt file...

it has around 700k words... many words are repeated...

so my aim is to remove the duplicate words so that i am left with only unique words...

1) load the entire file into memory with a single fread() call.
2) run across the resulting super-string with tolower() so that all capital letters are removed.
3) run across the file with strtok() to remove newlines, spaces, tabs etc building an array of pointers to each word.
4) sort the file into alphabetical order, which will put duplicate words together.
5) write out the file word by word, checking for adjacent duplicates as you go.

**Salem** · 11-17-2011

Or learn to do the job properly by calling fgets() in a loop.

Seriously CT, what are people gonna do when they start processing files which won't fit in memory for convenience?

**~~CommonTater~~** · 11-17-2011

Originally Posted by Salem

Or learn to do the job properly by calling fgets() in a loop.

Sure... then how do you locate dupes? By scanning the whole file for every word you load? Boring!

Seriously CT, what are people gonna do when they start processing files which won't fit in memory for convenience?

Buy more memory.

700,000 words with an average word length of 5 characters is ~3.5 megabytes... trivial in today's systems.

The only time that would be a problem is if someone is working on that Fred Flintstone, TurboC 16 bit compiler that just won't witherup and blow away! Today's 32 or 64 bit systems can handle it with ease.

As I mentioned to Adak in another thread, I have live software out there loading 100mb+ index files as routine without any problems at all.

EDIT: Just ran a quick test... loaded a 1.4gb movie in one pass on a 4gb system... still got 2gb to play with. Seriously, it's "old think" to figure you have to do these things a word at a time or even a line at a time, when you can just get the whole thing loaded up and take advantage of the (not insubstantial) speed increase by working in memory.

**Adak** · 11-17-2011

The way I did this in very small memory, was to use tiers:

1) load up 600 objects at a time (I used structs, but the same idea for strings).

2) Quicksort to sort them each first tier level array.

3) write them to file with a number (or letter) appended onto the filename: filename1, filename2, etc.

4) after all the files had been sorted out, I had dozens of these numbered files, which was perfect for a merge.

5) every level of the merge phase, reduced the number of files remaining to be merged, by half.

It did a fine job. Quicker than you'd expect. The smaller files would be merged very quickly because they fit into the HD buffer. Using an SSD drive makes this a very acceptably fast sort.

When you have massive - really massive amounts of data that need to be sorted, and you can't fit it into main memory - it can be sorted, using this tiered approach. It's not as simple or as fast as sorting in main memory, however .

**Salem** · 11-17-2011

If the file is sorted, then removing dupes is as easy as falling off a log.

If the file isn't sorted, then you could do other things, such as building a binary tree of unique lines. If there are many dupes, the memory footprint will be so much smaller than dumping the whole file into memory, AND the binary tree will make it extremely efficient at finding dupes (are you still bored?).

Remember, not everyone has the luxury of GB of memory all the time. Long before you've filled up available memory, your OS might begin swapping out parts of the file you've just read in, all because you're too lazy to think up a better data structure than "slab-o-memory" to solve the problem.

**~~CommonTater~~** · 11-17-2011

Originally Posted by Adak

The way I did this in very small memory, was to use tiers:

1) load up 600 objects at a time (I used structs, but the same idea for strings).

2) Quicksort to sort them each first tier level array.

3) write them to file with a number (or letter) appended onto the filename: filename1, filename2, etc.

4) after all the files had been sorted out, I had dozens of these numbered files, which was perfect for a merge.

5) every level of the merge phase, reduced the number of files remaining to be merged, by half.

It did a fine job. Quicker than you'd expect. The smaller files would be merged very quickly because they fit into the HD buffer. Using an SSD drive makes this a very acceptably fast sort.

When you have massive - really massive amounts of data that need to be sorted, and you can't fit it into main memory - it can be sorted, using this tiered approach. It's not as simple or as fast as sorting in main memory, however .

Ok... I just took a 100,000 word file, concatenated it to itself to give me 300,000 triplicated words.

I loaded the whole thing into memory, ran tolower across it, then strtok() to build an array of pointers to words, sorted it with a basic selection sort (swapping pointers not text), then wrote the thing out word per line style, using a double buffer to catch consecutive dupes... Total time just over 28 seconds on my older AMD system (x2, 2.4ghz, 4gb ddr2 800 ram, W7x64)

(Probably not braggable, but certainly not disastrous either.)

**~~CommonTater~~** · 11-17-2011

Originally Posted by Salem

Remember, not everyone has the luxury of GB of memory all the time. Long before you've filled up available memory, your OS might begin swapping out parts of the file you've just read in, all because you're too lazy to think up a better data structure than "slab-o-memory" to solve the problem.

Well, given that I have the swap files disabled on my 4 machines that's not going to happen.

Not necessarily lazy... just never felt the need.

**laserlight** · 11-17-2011

Originally Posted by CommonTater

sorted it with a basic selection sort (swapping pointers not text), (...) using a double buffer to catch consecutive dupes... Total time just over 28 seconds

Tee hee, I wonder if the "basic selection sort" took a major component of the total time since the "sort then handle consecutive duplicates" method beats the "linear search for duplicates" in time complexity only when a sorting algorithm with better time complexity is used.

**~~CommonTater~~** · 11-17-2011

Originally Posted by laserlight

Tee hee, I wonder if the "basic selection sort" took a major component of the total time since the "sort then handle consecutive duplicates" method beats the "linear search for duplicates" in time complexity only when a sorting algorithm with better time complexity is used.

I was aware of that when I wrote it... but I had a code snippet tucked away in the ide and rather than spend all night messing with it I just did a little scoop and poop... Yes the lion's share of the time was spent sorting... probably 90%... A more advanced algorythm, even good old quicksort, would definately bring the time down by a significant amount.

**Adak** · 11-17-2011

With the nature of Cooljack's question in mind, I believe the simpler solution is definitely the way to go. <eww, minus the selection sort! It's a real stinker for data that isn't already mostly sorted.>

"Even old quicksort?" You mean the optimized Quicksort that IS the fastest general purpose sorter on the planet, right?
The old Quicksort's may have been Hoare-able, but the new one's really rock!

Thread: Need help to write a program in c

Thread Tools

Search Thread

Display

Need help to write a program in c

Similar Threads

trying to write a tic, tac, toe program

Write program that uses MS-DOS?

How to write a program

Help to write a program

How to write a program for...