Thread: Need help to write a program in c

  1. #1
    Registered User
    Join Date
    Nov 2011
    Posts
    2

    Need help to write a program in c

    Hi,

    I need to write a c program to remove duplicates from a file.
    how do i go about writing this program... please help...
    I am a newbie in c...

    I have thought about 2 ways to proceed with the program...dont know if its correct or even can be done... just a thought..

    1. first read a line from a file, then read the second line from the file and compare them. If they are same then delete the second line. Now read the third line and compare it with the first line and so on...

    2. or read the file line by line and store it in an array and then make comparisons within the array...

    please help...

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by cooljack08
    1. first read a line from a file, then read the second line from the file and compare them. If they are same then delete the second line. Now read the third line and compare it with the first line and so on...
    Incomplete (what if they are different?), but it could work for removing consecutive duplicate lines.

    Quote Originally Posted by cooljack08
    2. or read the file line by line and store it in an array and then make comparisons within the array...
    Yes, though you need to be clear on what comparisons you intend to make.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Nov 2011
    Posts
    2
    thanks 4 your quick reply...

    i have a worldlist.txt file...

    it has around 700k words... many words are repeated...

    so my aim is to remove the duplicate words so that i am left with only unique words...

  4. #4
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    So, what's your plan now?
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  5. #5
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    If I close my eyes, I swear I can hear the "gears" turning and turning!

    Go Jack, go!

  6. #6
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by cooljack08 View Post
    thanks 4 your quick reply...

    i have a worldlist.txt file...

    it has around 700k words... many words are repeated...

    so my aim is to remove the duplicate words so that i am left with only unique words...
    1) load the entire file into memory with a single fread() call.
    2) run across the resulting super-string with tolower() so that all capital letters are removed.
    3) run across the file with strtok() to remove newlines, spaces, tabs etc building an array of pointers to each word.
    4) sort the file into alphabetical order, which will put duplicate words together.
    5) write out the file word by word, checking for adjacent duplicates as you go.

  7. #7
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,666
    Or learn to do the job properly by calling fgets() in a loop.

    Seriously CT, what are people gonna do when they start processing files which won't fit in memory for convenience?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  8. #8
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Salem View Post
    Or learn to do the job properly by calling fgets() in a loop.
    Sure... then how do you locate dupes? By scanning the whole file for every word you load? Boring!

    Seriously CT, what are people gonna do when they start processing files which won't fit in memory for convenience?
    Buy more memory.

    700,000 words with an average word length of 5 characters is ~3.5 megabytes... trivial in today's systems.

    The only time that would be a problem is if someone is working on that Fred Flintstone, TurboC 16 bit compiler that just won't witherup and blow away! Today's 32 or 64 bit systems can handle it with ease.

    As I mentioned to Adak in another thread, I have live software out there loading 100mb+ index files as routine without any problems at all.

    EDIT: Just ran a quick test... loaded a 1.4gb movie in one pass on a 4gb system... still got 2gb to play with. Seriously, it's "old think" to figure you have to do these things a word at a time or even a line at a time, when you can just get the whole thing loaded up and take advantage of the (not insubstantial) speed increase by working in memory.
    Last edited by CommonTater; 11-17-2011 at 12:11 PM.

  9. #9
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    The way I did this in very small memory, was to use tiers:

    1) load up 600 objects at a time (I used structs, but the same idea for strings).

    2) Quicksort to sort them each first tier level array.

    3) write them to file with a number (or letter) appended onto the filename: filename1, filename2, etc.

    4) after all the files had been sorted out, I had dozens of these numbered files, which was perfect for a merge.

    5) every level of the merge phase, reduced the number of files remaining to be merged, by half.

    It did a fine job. Quicker than you'd expect. The smaller files would be merged very quickly because they fit into the HD buffer. Using an SSD drive makes this a very acceptably fast sort.

    When you have massive - really massive amounts of data that need to be sorted, and you can't fit it into main memory - it can be sorted, using this tiered approach. It's not as simple or as fast as sorting in main memory, however .

  10. #10
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,666
    If the file is sorted, then removing dupes is as easy as falling off a log.

    If the file isn't sorted, then you could do other things, such as building a binary tree of unique lines. If there are many dupes, the memory footprint will be so much smaller than dumping the whole file into memory, AND the binary tree will make it extremely efficient at finding dupes (are you still bored?).

    Remember, not everyone has the luxury of GB of memory all the time. Long before you've filled up available memory, your OS might begin swapping out parts of the file you've just read in, all because you're too lazy to think up a better data structure than "slab-o-memory" to solve the problem.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  11. #11
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Adak View Post
    The way I did this in very small memory, was to use tiers:

    1) load up 600 objects at a time (I used structs, but the same idea for strings).

    2) Quicksort to sort them each first tier level array.

    3) write them to file with a number (or letter) appended onto the filename: filename1, filename2, etc.

    4) after all the files had been sorted out, I had dozens of these numbered files, which was perfect for a merge.

    5) every level of the merge phase, reduced the number of files remaining to be merged, by half.

    It did a fine job. Quicker than you'd expect. The smaller files would be merged very quickly because they fit into the HD buffer. Using an SSD drive makes this a very acceptably fast sort.

    When you have massive - really massive amounts of data that need to be sorted, and you can't fit it into main memory - it can be sorted, using this tiered approach. It's not as simple or as fast as sorting in main memory, however .
    Ok... I just took a 100,000 word file, concatenated it to itself to give me 300,000 triplicated words.

    I loaded the whole thing into memory, ran tolower across it, then strtok() to build an array of pointers to words, sorted it with a basic selection sort (swapping pointers not text), then wrote the thing out word per line style, using a double buffer to catch consecutive dupes... Total time just over 28 seconds on my older AMD system (x2, 2.4ghz, 4gb ddr2 800 ram, W7x64)

    (Probably not braggable, but certainly not disastrous either.)
    Last edited by CommonTater; 11-17-2011 at 03:31 PM.

  12. #12
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Salem View Post
    Remember, not everyone has the luxury of GB of memory all the time. Long before you've filled up available memory, your OS might begin swapping out parts of the file you've just read in, all because you're too lazy to think up a better data structure than "slab-o-memory" to solve the problem.
    Well, given that I have the swap files disabled on my 4 machines that's not going to happen.

    Not necessarily lazy... just never felt the need.

  13. #13
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by CommonTater
    sorted it with a basic selection sort (swapping pointers not text), (...) using a double buffer to catch consecutive dupes... Total time just over 28 seconds
    Tee hee, I wonder if the "basic selection sort" took a major component of the total time since the "sort then handle consecutive duplicates" method beats the "linear search for duplicates" in time complexity only when a sorting algorithm with better time complexity is used.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  14. #14
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by laserlight View Post
    Tee hee, I wonder if the "basic selection sort" took a major component of the total time since the "sort then handle consecutive duplicates" method beats the "linear search for duplicates" in time complexity only when a sorting algorithm with better time complexity is used.
    I was aware of that when I wrote it... but I had a code snippet tucked away in the ide and rather than spend all night messing with it I just did a little scoop and poop... Yes the lion's share of the time was spent sorting... probably 90%... A more advanced algorythm, even good old quicksort, would definately bring the time down by a significant amount.

  15. #15
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    With the nature of Cooljack's question in mind, I believe the simpler solution is definitely the way to go. <eww, minus the selection sort! It's a real stinker for data that isn't already mostly sorted.>

    "Even old quicksort?" You mean the optimized Quicksort that IS the fastest general purpose sorter on the planet, right?
    The old Quicksort's may have been Hoare-able, but the new one's really rock!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. trying to write a tic, tac, toe program
    By jason007thomas in forum C Programming
    Replies: 5
    Last Post: 09-01-2010, 10:54 AM
  2. Write program that uses MS-DOS?
    By Cgrasshopper in forum C Programming
    Replies: 5
    Last Post: 06-22-2010, 11:48 PM
  3. How to write a program
    By Tashfique in forum C++ Programming
    Replies: 4
    Last Post: 10-17-2008, 11:28 AM
  4. Help to write a program
    By vinith98 in forum C Programming
    Replies: 10
    Last Post: 04-07-2008, 12:11 AM
  5. How to write a program for...
    By babyboy in forum C Programming
    Replies: 6
    Last Post: 02-15-2008, 06:17 AM