Thread: removing duplicates and complexity

  1. #1
    Registered User
    Join Date
    Nov 2010
    Posts
    44

    removing duplicates and complexity

    hi all, im having trouble to determine the best way to solve this problem

    suppose that you have an array of integers and you want to remove duplicates

    i have thought of some solutions but im not sure

    1) i create a second array and for every element in the original array i check if it's in the second array using a second loop, if it's not there, i add it to the second array, else i go to the next element of the first array..

    i think we have worst case when there are no duplicates, and the complexity is O(n^2)

    2) use heapsort O(n*logn) to sort the array and then I use a pointer to the element im about to check, I then use a counter to loop the array and I remove the values that are equals to the value the pointers is pointing to, but the problem is that I remove them by bringing back all the elements of the remaining array, doesn't this give us O(n*logn) + O(n^2) = O(n^2) or is it O(n*logn) + O(n) = O(n*logn)?

    3) use an avl tree, or a simple binary tree, add the values of the original array there, then use inorder/preorder/postorder and add each element to the final array, this gives us O(n*logn) for adding and O(n) for inorder so it's O(n*logn) right?

    4) use a hash table.. add the elements in the hash table -> O(1) for each element hence O(n), then loop for each element and see if it's there or not, if it's there i dont put it in the final array, else i put it in the final array

    so assuming that checking if an element is in the hash table gives O(1) for each element, this gives us O(n) + O(n) = O(n) complexity..

    im not sure if these complexities i've calculated are correct.. can you please help me?

    which is the best method? thanks in advance

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Does the output order matter?

    Can it be sorted, or does it have to be
    7 3 4 3 8 9
    becomes
    7 3 4 8 9
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Nov 2010
    Posts
    44
    Quote Originally Posted by Salem View Post
    Does the output order matter?

    Can it be sorted, or does it have to be
    7 3 4 3 8 9
    becomes
    7 3 4 8 9
    no the output doesn't matter

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    A sort followed by a linear scan would be best IMO.

    All the duplicates will be adjacent, so you just output the first and then read ahead until the value changes.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Registered User
    Join Date
    Nov 2010
    Posts
    44
    Quote Originally Posted by Salem View Post
    A sort followed by a linear scan would be best IMO.

    All the duplicates will be adjacent, so you just output the first and then read ahead until the value changes.
    so if I want to have the values stored, it's better not to store them in the same array because it will be O(n^2) right?

    if I use a second array I think it will be O(nlogn)..

    can you please tell me if my complexities for each algorithm(1-4) are correct? I'm probably going to use what you suggested

  6. #6
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by nik
    so if I want to have the values stored, it's better not to store them in the same array because it will be O(n^2) right?
    No, the complexity would still be dominated by the sort (hence O(n log n)) if you do the removal correctly, though it certainly is easier to implement by hand using a second array.

    Quote Originally Posted by nik
    can you please tell me if my complexities for each algorithm(1-4) are correct?
    It looks like it, assuming that #2 removes elements by shifting the entire remaining portion of the array. Relating them to standard containers, #3 would be roughly be equivalent to using a std::set and #4 would roughly be equivalent to using a std::tr1::unordered_set.

    Quote Originally Posted by nik
    I'm probably going to use what you suggested
    You could then use std::sort with std::unique or std::unique_copy.
    Last edited by laserlight; 01-27-2011 at 11:54 AM.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  7. #7
    Registered User
    Join Date
    Nov 2010
    Posts
    44
    thank you very much friends you've helped me a lot

  8. #8
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    By the way, the standard library provides the algorithms to do this.

    Code:
    std::vector<int> with_dupes = data();
    std::sort(with_dupes.begin(), with_dupes.end());
    with_dupes.erase(std::unique(with_dupes.begin(), with_dupes.end()), with_dupes.end());
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  9. #9
    Registered User
    Join Date
    Oct 2006
    Posts
    3,445
    doesn't std::set do this for you?

  10. #10
    Registered User
    Join Date
    Nov 2010
    Posts
    44
    yes but I wanted to compare different kind of approaches to this problem and figure out which one has the best time complexity

  11. #11
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    If you're in bounded integer space (i.e. you know that the elements of your vector are below a reasonably small N) the fastest way of removing dupes is to create a bitset of size N and use it to mark numbers that have already occurred.
    Reasonable N is of course bounded by the amount of memory you're willing to spare. A megabyte allows you to have numbers up to 2^23.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

Popular pages Recent additions subscribe to a feed