removing duplicates and complexity

**nik** · 01-27-2011

hi all, im having trouble to determine the best way to solve this problem

suppose that you have an array of integers and you want to remove duplicates

i have thought of some solutions but im not sure

1) i create a second array and for every element in the original array i check if it's in the second array using a second loop, if it's not there, i add it to the second array, else i go to the next element of the first array..

i think we have worst case when there are no duplicates, and the complexity is O(n^2)

2) use heapsort O(n*logn) to sort the array and then I use a pointer to the element im about to check, I then use a counter to loop the array and I remove the values that are equals to the value the pointers is pointing to, but the problem is that I remove them by bringing back all the elements of the remaining array, doesn't this give us O(n*logn) + O(n^2) = O(n^2) or is it O(n*logn) + O(n) = O(n*logn)?

3) use an avl tree, or a simple binary tree, add the values of the original array there, then use inorder/preorder/postorder and add each element to the final array, this gives us O(n*logn) for adding and O(n) for inorder so it's O(n*logn) right?

4) use a hash table.. add the elements in the hash table -> O(1) for each element hence O(n), then loop for each element and see if it's there or not, if it's there i dont put it in the final array, else i put it in the final array

so assuming that checking if an element is in the hash table gives O(1) for each element, this gives us O(n) + O(n) = O(n) complexity..

im not sure if these complexities i've calculated are correct.. can you please help me?

which is the best method? thanks in advance

**Salem** · 01-27-2011

Does the output order matter?

Can it be sorted, or does it have to be
7 3 4 3 8 9
becomes
7 3 4 8 9

**nik** · 01-27-2011

Originally Posted by Salem

Does the output order matter?

Can it be sorted, or does it have to be
7 3 4 3 8 9
becomes
7 3 4 8 9

no the output doesn't matter

**Salem** · 01-27-2011

A sort followed by a linear scan would be best IMO.

All the duplicates will be adjacent, so you just output the first and then read ahead until the value changes.

**nik** · 01-27-2011

Originally Posted by Salem

A sort followed by a linear scan would be best IMO.

All the duplicates will be adjacent, so you just output the first and then read ahead until the value changes.

so if I want to have the values stored, it's better not to store them in the same array because it will be O(n^2) right?

if I use a second array I think it will be O(nlogn)..

can you please tell me if my complexities for each algorithm(1-4) are correct? I'm probably going to use what you suggested

**laserlight** · 01-27-2011

Originally Posted by nik

so if I want to have the values stored, it's better not to store them in the same array because it will be O(n^2) right?

No, the complexity would still be dominated by the sort (hence O(n log n)) if you do the removal correctly, though it certainly is easier to implement by hand using a second array.

Originally Posted by nik

can you please tell me if my complexities for each algorithm(1-4) are correct?

It looks like it, assuming that #2 removes elements by shifting the entire remaining portion of the array. Relating them to standard containers, #3 would be roughly be equivalent to using a std::set and #4 would roughly be equivalent to using a std::tr1::unordered_set.

Originally Posted by nik

I'm probably going to use what you suggested

You could then use std::sort with std::unique or std::unique_copy.

**nik** · 01-27-2011

thank you very much friends you've helped me a lot

**CornedBee** · 01-27-2011

By the way, the standard library provides the algorithms to do this.

Code:

std::vector<int> with_dupes = data();
std::sort(with_dupes.begin(), with_dupes.end());
with_dupes.erase(std::unique(with_dupes.begin(), with_dupes.end()), with_dupes.end());

**Elkvis** · 01-27-2011

doesn't std::set do this for you?

**nik** · 01-27-2011

yes but I wanted to compare different kind of approaches to this problem and figure out which one has the best time complexity

**CornedBee** · 01-27-2011

If you're in bounded integer space (i.e. you know that the elements of your vector are below a reasonably small N) the fastest way of removing dupes is to create a bitset of size N and use it to mark numbers that have already occurred.
Reasonable N is of course bounded by the amount of memory you're willing to spare. A megabyte allows you to have numbers up to 2^23.

Thread: removing duplicates and complexity

Thread Tools

Search Thread

Display

removing duplicates and complexity