Thread: need to find duplicate strings in file`

  1. #1
    Registered User
    Join Date
    Aug 2010
    Posts
    1

    need to find duplicate strings in file`

    Hi All,
    I need to write a C code for finding the duplicate strings in a file.
    the problem is all strings are concatenated in a single file.
    I first need to take out the strings based on number of characters as no delimiter is present than do the comparision.
    i need to get all the duplicate strings in another file.

    the filesize is really huge 10 to power 8 strings.

    For ex:
    Input file:
    abcddefgdefgghijjikl

    considering the size of string is four

    expected output:
    defg


    thanks in advance

  2. #2
    Registered User
    Join Date
    Oct 2008
    Location
    TX
    Posts
    2,059
    Since when did characters become strings

  3. #3
    Registered User
    Join Date
    Aug 2010
    Location
    England
    Posts
    90
    Are you familiar with binary tree sorts

    fred
    bill george
    alf cal gary harold

    As you pick up each character string, you "hang" them on the tree. There are plenty of libraries with the code for this.

    Each node in the tree stores the character string, plus the repeat count.

    If you are worried about memory, then the tree can be created as a random access file, with each record being a node, and forward/back pointers stored in each record.

    As I said there is lots of library code for this.
    Never re-write code unless the user benefits

  4. #4
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    Interesting problem. There could be many duplicate strings. I suppose you want to find the longest one. I don't know an algorithm off hand. It's not a C problem until after you know an algorithm.

  5. #5
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Welcome to the forum, E.Vikaas!

  6. #6
    Registered User
    Join Date
    Aug 2010
    Location
    England
    Posts
    90
    Try this link.

    Should help get you started

    How to Create a Binary Tree in C | eHow.com
    Never re-write code unless the user benefits

  7. #7
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    The extraction of strings according to the given length should be rather trivial. To gain confidence, you can do that first and make sure that it is working.

    One general idea to find duplicate strings is to map strings to a corresponding count of each string. Once you have constructed this map, you then iterate over it and copy those strings which counts greater than 1 to the other file. johnggold's suggestion of a (balanced) binary tree can be used to implement this map, but it is not the only applicable data structure.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Strings Find and Replace
    By kakayoma in forum C++ Programming
    Replies: 6
    Last Post: 08-05-2009, 06:18 PM
  2. set<string> find
    By lawrenced in forum C++ Programming
    Replies: 3
    Last Post: 07-21-2009, 05:00 PM
  3. find and replace duplicate numbers in array
    By Cathalo in forum C++ Programming
    Replies: 5
    Last Post: 02-17-2009, 11:05 AM
  4. Getting the number of strings in a STRINGTABLE resource
    By eth0 in forum Windows Programming
    Replies: 1
    Last Post: 09-30-2005, 02:57 AM
  5. Reading strings input by the user...
    By Cmuppet in forum C Programming
    Replies: 13
    Last Post: 07-21-2004, 06:37 AM