Finding duplicates and merging

**Happy Man** · 04-18-2004

Hello All,

I am working on two algorithms and your help will eb appreciated.

1. I have an array of any length( 10 to 10000 element), and some items are duplicated. Need to write algo that removes the duplcated items and keeps only one item whose index is less.

2. Two arrau of any lentgh and some lements are common. Need to be merged by removing duplicated item.

I am working on these algo and will be posting the code soon. ANy help before that os appreciated.

Thanks,
Happy Man for (;

**Prelude** · 04-18-2004

The first solution that comes to mind:

1) Sort and copy only the first of any duplicated elements. The sort step makes this part trivial.

2) Copy the two arrays into one, sort, then perform task 1.

**Happy Man** · 04-18-2004

For me sorting is not necessary or rather it is not part of requirment. Actually, duplicates need to be removed but order of elements should be kept same.

e.g. arr[] = {4,6,3,4,2,1,6,9};
After execution
arr[] = {4,6,3,,2,1,9};

Any inputs????

**Prelude** · 04-18-2004

>duplicates need to be removed but order of elements should be kept same
Then a frequency table would be a good option:

Code:

#include <stdio.h>

int main ( void )
{
  int arr[] = {4,6,3,4,2,1,6,9};
  int save[8];
  int dup[10] = {0};
  int i, j = 0;

  for ( i = 0; i < 8; i++ ) {
    if ( dup[ arr[i] ] == 0 ) {
      save[j++] = arr[i];
      dup[ arr[i] ] = 1;
    }
  }
  for ( i = 0; i < j; i++ ) {
    printf ( "%d ", save[i] );
  }
  printf ( "\n" );

  return 0;
}

You would need to modify the above to scale well to the quantities for your problem. For example, a binary search tree would more easily handle large quantities of values in a broad range than a table, but you lose the benefit of constant time searching.

**Thantos** · 04-18-2004

Its really not possible to "remove" an object out of an array, since that spot will have to have *some* value. What you can do is to create another array to put the results into. I'll leave it up to you to improve upon this code

Code:

#include <stdio.h>

int main(void)
{
  int arr[]= {4,6,3,4,2,1,6,9};
  int dest[sizeof arr/sizeof arr[0]];
  int count, count2, size, open=0;
  size = sizeof arr/sizeof arr[0];
  printf("size: %d\n", size);

  for (count=0; count < size; count++)
    for (count2=0; count2<size; count2++)
      if ( arr[count] == arr[count2] && count2 < count)
      {
        open++;
        break;
      }
      else
      {
        dest[count-open] = arr[count];
      }
  printf("Open: %d\n", open);
  for (count=0; count<size-open; count++)
  {
    printf("%d\n", dest[count]);
  }

  return 0;
}

**Happy Man** · 04-18-2004

Thantos,

Is there any way we can eliminate use of second array ?

Thanks,
Happy Man

**Prelude** · 04-18-2004

>Is there any way we can eliminate use of second array ?
Yes, but it wouldn't be wise when your array will potentially have up to 10,000 elements. The cost of shifting elements to repair holes would be far too great if there are a lot of duplicates anywhere except the end of the array.

**linuxdude** · 04-18-2004

Prelude(or anyone knowledgeable on the subject) could you explain frequency tables. It is odd to me. And array with an array for an element(but I know that that is not what it is)?

**Prelude** · 04-18-2004

>Prelude(or anyone knowledgeable on the subject) could you explain frequency tables.
Imagine an array that is big enough to hold any possible value in a set (such as 0-9). The value itself is the index, and the contents of the frequency table would be a count of how many values matching that index were found. So a frequency table of the set {4,6,3,4,2,1,6,9} would be:

Code:

freq[0] = 0
freq[1] = 1
freq[2] = 1
freq[3] = 1
freq[4] = 2
freq[5] = 0
freq[6] = 2
freq[7] = 0
freq[8] = 0
freq[9] = 1

That's all it is at its simplest, an array telling you how many of each of the values are in the set. It gets harder when you know that the range of values will be large or sparse. Then you'll need to resort to a data structure better suited to holding unique items and marking the frequency without being too wasteful (such as a binary search tree).

**linuxdude** · 04-19-2004

Code:

    if ( dup[ arr[i] ] == 0 ) {
      save[j++] = arr[i];
      dup[ arr[i] ] = 1;

I have having trouble with the first and 3rd lines. dup[arr[i]] is that the same as dup[the value of arr[i]]?? Thanx for your patience, but this is pretty interesting

**Prelude** · 04-19-2004

>dup[arr[i]] is that the same as dup[the value of arr[i]]??
Yes. If the value at arr[i] is 5 then the code would be the same as saying dup[5].

**Happy Man** · 04-19-2004

Like algorthms for sorting, are there any standard algorithms for finding duplicates in array? Or can we use same sorting algorithms to find duplicates? Instead checking > or < check for ==.

Please let me know

**Prelude** · 04-19-2004

>are there any standard algorithms for finding duplicates in array?
The "obvious" and most common algorithm is to walk over the array in an inner loop testing for equivalence with the current item in the outer loop:

Code:

for ( i = 0; i < n; i++ ) {
  for ( j = 0; j < n; j++ ) {
    if ( a[i] == a[j] ) {
      /* Handle duplicate */
    }
  }
}

Naturally, this is dreadfully inefficient for large arrays.

**XSquared** · 04-19-2004

>Naturally, this is dreadfully inefficient for large arrays.
Do you enjoy leaving us haning? I'd really like to know what a more efficient method is for larger arrays... come on... please??? *sad puppy eyes*

**Prelude** · 04-19-2004

>I'd really like to know what a more efficient method is for larger arrays
See my previous posts then.

My first suggestion was O(n log n), my second was O(n), and my improvement of the second for larger ranges was O(log n). All of them are substantially better than the O(n^2) nested loop solution.

Thread: Finding duplicates and merging

Thread Tools

Search Thread

Display

Finding duplicates and merging