# Thread: removing duplicates and complexity

1. ## removing duplicates and complexity

hi all, im having trouble to determine the best way to solve this problem

suppose that you have an array of integers and you want to remove duplicates

i have thought of some solutions but im not sure

1) i create a second array and for every element in the original array i check if it's in the second array using a second loop, if it's not there, i add it to the second array, else i go to the next element of the first array..

i think we have worst case when there are no duplicates, and the complexity is O(n^2)

2) use heapsort O(n*logn) to sort the array and then I use a pointer to the element im about to check, I then use a counter to loop the array and I remove the values that are equals to the value the pointers is pointing to, but the problem is that I remove them by bringing back all the elements of the remaining array, doesn't this give us O(n*logn) + O(n^2) = O(n^2) or is it O(n*logn) + O(n) = O(n*logn)?

3) use an avl tree, or a simple binary tree, add the values of the original array there, then use inorder/preorder/postorder and add each element to the final array, this gives us O(n*logn) for adding and O(n) for inorder so it's O(n*logn) right?

4) use a hash table.. add the elements in the hash table -> O(1) for each element hence O(n), then loop for each element and see if it's there or not, if it's there i dont put it in the final array, else i put it in the final array

so assuming that checking if an element is in the hash table gives O(1) for each element, this gives us O(n) + O(n) = O(n) complexity..

which is the best method? thanks in advance

2. Does the output order matter?

Can it be sorted, or does it have to be
7 3 4 3 8 9
becomes
7 3 4 8 9

3. Originally Posted by Salem
Does the output order matter?

Can it be sorted, or does it have to be
7 3 4 3 8 9
becomes
7 3 4 8 9
no the output doesn't matter

4. A sort followed by a linear scan would be best IMO.

All the duplicates will be adjacent, so you just output the first and then read ahead until the value changes.

5. Originally Posted by Salem
A sort followed by a linear scan would be best IMO.

All the duplicates will be adjacent, so you just output the first and then read ahead until the value changes.
so if I want to have the values stored, it's better not to store them in the same array because it will be O(n^2) right?

if I use a second array I think it will be O(nlogn)..

can you please tell me if my complexities for each algorithm(1-4) are correct? I'm probably going to use what you suggested

6. Originally Posted by nik
so if I want to have the values stored, it's better not to store them in the same array because it will be O(n^2) right?
No, the complexity would still be dominated by the sort (hence O(n log n)) if you do the removal correctly, though it certainly is easier to implement by hand using a second array.

Originally Posted by nik
can you please tell me if my complexities for each algorithm(1-4) are correct?
It looks like it, assuming that #2 removes elements by shifting the entire remaining portion of the array. Relating them to standard containers, #3 would be roughly be equivalent to using a std::set and #4 would roughly be equivalent to using a std::tr1::unordered_set.

Originally Posted by nik
I'm probably going to use what you suggested
You could then use std::sort with std::unique or std::unique_copy.

7. thank you very much friends you've helped me a lot

8. By the way, the standard library provides the algorithms to do this.

Code:
```std::vector<int> with_dupes = data();
std::sort(with_dupes.begin(), with_dupes.end());
with_dupes.erase(std::unique(with_dupes.begin(), with_dupes.end()), with_dupes.end());```

9. doesn't std::set do this for you?

10. yes but I wanted to compare different kind of approaches to this problem and figure out which one has the best time complexity

11. If you're in bounded integer space (i.e. you know that the elements of your vector are below a reasonably small N) the fastest way of removing dupes is to create a bitset of size N and use it to mark numbers that have already occurred.
Reasonable N is of course bounded by the amount of memory you're willing to spare. A megabyte allows you to have numbers up to 2^23.