Thread: Finding unique strings in an array

  1. #1
    Registered User
    Join Date
    Jan 2008
    Location
    Oxford, England
    Posts
    27

    Finding unique strings in an array

    There's a Perl technique I often find myself using for finding unique strings in an array that involves using a hash to keep track of which strings have already been processed. It goes something like this:

    Code:
    %seen = ();
    foreach $string (@list)
    {
      next if $seen($string) == 1;
      # do some stuff with $string
      $seen($string) = 1;
    }
    What would be the nearest C equivalent to that?

    The context is that I have some code written by someone else that loads a large number of entries into a database, but needs to check that they have not already been loaded. It does this by repeatedly querying the database and this becomes very slow. Using the above Perl trick avoids this problem but a Perl script to do the same as the C code is not very efficient otherwise.
    Last edited by knirirr; 02-20-2008 at 06:11 AM. Reason: typo in code

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    You can do that in C, but you have to decide how you translate a string into an array of counts/booleans.

    One solution is to keep a sorted list (e.g. an array) of the strings themselves, so you can do a binary search in the list. If the string is already in the list, you know that's been seen. If it's not in the list, you insert it at the point where you ended up in the binary search.

    Another solution would be to hash the input string, and use a hash-table to find the string.

    The ratio between "new words" and "previously seen words" will determine which method works best - the more often you need to insert, the worse the inserting into the array would be.

    Both variants have their own benefits and drawbacks, but will both be relatively fast as long as the list isn't so huge that it doesn't fit in the available RAM in the machine. Obviously inserting hundreds, thousands or millions of words into a list can be time-consuming.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #3
    Registered User
    Join Date
    Jan 2008
    Location
    Oxford, England
    Posts
    27
    Thanks for the suggestions.
    RAM shouldn't be a problem here as the database server has 32GB of it and the strings come from a 2GB file. There will be quite a lot of duplicates, so presumably the second method you suggest would be the quickest.

  4. #4
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Lots of duplicates means that you DON'T suffer badly from insertions - because it's only the first time you see a word, you will need to insert something.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 16
    Last Post: 05-29-2009, 07:25 PM
  2. arrays vs lists? And containers in general!
    By clegs in forum C++ Programming
    Replies: 22
    Last Post: 12-03-2007, 02:02 PM
  3. Build an array of strings dynamically
    By Nazgulled in forum C Programming
    Replies: 29
    Last Post: 04-07-2007, 09:35 PM
  4. Passing an Array of Strings from VB to a C DLL
    By mr_nice! in forum Windows Programming
    Replies: 9
    Last Post: 03-08-2005, 06:16 AM
  5. declaring array of strings dynamically!!!!
    By Real in forum C Programming
    Replies: 1
    Last Post: 08-29-2003, 06:59 AM