Finding unique strings in an array

**knirirr** · 02-20-2008

There's a Perl technique I often find myself using for finding unique strings in an array that involves using a hash to keep track of which strings have already been processed. It goes something like this:

Code:

&#37;seen = ();
foreach $string (@list)
{
  next if $seen($string) == 1;
  # do some stuff with $string
  $seen($string) = 1;
}

What would be the nearest C equivalent to that?

The context is that I have some code written by someone else that loads a large number of entries into a database, but needs to check that they have not already been loaded. It does this by repeatedly querying the database and this becomes very slow. Using the above Perl trick avoids this problem but a Perl script to do the same as the C code is not very efficient otherwise.

**matsp** · 02-20-2008

You can do that in C, but you have to decide how you translate a string into an array of counts/booleans.

One solution is to keep a sorted list (e.g. an array) of the strings themselves, so you can do a binary search in the list. If the string is already in the list, you know that's been seen. If it's not in the list, you insert it at the point where you ended up in the binary search.

Another solution would be to hash the input string, and use a hash-table to find the string.

The ratio between "new words" and "previously seen words" will determine which method works best - the more often you need to insert, the worse the inserting into the array would be.

Both variants have their own benefits and drawbacks, but will both be relatively fast as long as the list isn't so huge that it doesn't fit in the available RAM in the machine. Obviously inserting hundreds, thousands or millions of words into a list can be time-consuming.

--
Mats

**knirirr** · 02-20-2008

Thanks for the suggestions.
RAM shouldn't be a problem here as the database server has 32GB of it and the strings come from a 2GB file. There will be quite a lot of duplicates, so presumably the second method you suggest would be the quickest.

**matsp** · 02-20-2008

Lots of duplicates means that you DON'T suffer badly from insertions - because it's only the first time you see a word, you will need to insert something.

--
Mats

Thread: Finding unique strings in an array

Thread Tools

Search Thread

Display

Finding unique strings in an array

Similar Threads

allocation and reallocation of memory dynamically of an integer array

arrays vs lists? And containers in general!

Build an array of strings dynamically

Passing an Array of Strings from VB to a C DLL

declaring array of strings dynamically!!!!