Thread: need help to sort and remove duplicate entries

  1. #1
    Registered User
    Join Date
    Nov 2008
    Posts
    222

    Thumbs up need help to sort and remove duplicate entries

    I have an unsorted set of data that is stored in CSV file. I need to use time stamp to ensure the right set of data column. I need to eliminate duplicate entries from the set of data in CSV file. There are more than 10,000 set of data in this huge CSV file, in the below format.

    Any idea which method to use here for better performance?

    order number activity timestamp
    99011267890987600123548; register order;2014-03-26 10:32:56.000
    990112678909875001235486; register order;2014-03-26 10:25:56.000
    990112678909876300123546;register order;2014-03-22 10:31:56.000
    990112678909873001235436;register order;2014-01-26 11:42:56.000
    990112678909872001235426;register order;2014-02-26 10:32:46.000
    990112678909872002235216;check stock;2014-03-26 10:32:46.000
    990112678909872002235216;ship order;2014-01-26 11:22:16.000
    990112678909872002235216;handle payment;2014-03-26 10:32:46.000
    Last edited by leo2008; 03-15-2021 at 05:10 AM.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Perhaps use a better tool than C++?

    Code:
    #!/usr/bin/env python
    import sys
    import re
    
    m1 = re.compile(r'^(.*?);')
    seen = {}
    for line in sys.stdin.readlines():
      l = line.rstrip('\n')
      m = m1.match(l)
      if m:
        if not m.group(1) in seen:
          seen[m.group(1)] = 1
          print("{}".format(l))
      else:
        print("{}".format(l))
    Eg.
    Code:
    $ ./foo.py < foo.txt
    order number activity timestamp
    99011267890987600123548; register order;2014-03-26 10:32:56.000
    990112678909875001235486; register order;2014-03-26 10:25:56.000
    990112678909876300123546;register order;2014-03-22 10:31:56.000
    990112678909873001235436;register order;2014-01-26 11:42:56.000
    990112678909872001235426;register order;2014-02-26 10:32:46.000
    990112678909872002235216;check stock;2014-03-26 10:32:46.000
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Nov 2008
    Posts
    222
    Is it possible to use C++ or Java? Any sample references would help here.

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > m1 = re.compile(r'^(.*?);')
    Use this -> std::basic_regex - cppreference.com

    > seen = {}
    Use this -> std::map - cppreference.com
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Registered User
    Join Date
    Nov 2008
    Posts
    222
    As far as i know, there is no reader writer for CSV file in C++. do we need to use any libraries here like Boost?
    How can I sort the set of data stored in CSV file? I need to eliminate duplicate entries from the set of data in this huge CSV file. Any algorithm to do this?

  6. #6
    Registered User
    Join Date
    Feb 2021
    Posts
    6
    Some garbage code (without error checking, for simplicity):
    Code:
    #include <cstdio>
    #include <cstring>
    #include <algorithm>
    #include <string>
    #include <vector>
    #include <set>
    
    int main() {
        FILE* in = fopen("input", "r");
        FILE* out = fopen("output", "w");
        
        std::set<std::string> lookup;
        std::vector<std::string> lines;
        
        char buffer[256];
        while(fgets(buffer, 256, in) != NULL) {
            char* timestamp = strrchr(buffer, ';');
            std::string str(timestamp);
            if(lookup.find(str) == lookup.end()) {
                lookup.insert(str);
                lines.push_back(buffer);
            }
        }
        
        // Sort and output.
        std::sort(lines.begin(), lines.end());
        for(auto line : lines)
        {
            fputs(line.c_str(), out);
        }
    
        fclose(in);
        fclose(out);
        return 0;
    }
    The idea is: keep a set of unique timestamps, add lines without a duplicate timestamp to a vector for sorting. I make many assumptions; importantly, the input format shall be consistent, and the final character in the input file shall be a newline. You should adapt this code to fit your requirements.

  7. #7
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > As far as i know, there is no reader writer for CSV file in C++
    Why do you need this?

    > leo2008, Join Date Nov 2008
    Why are you still asking?

    Perhaps because you never stick around long enough to consolidate what you learnt.

    Showing up every couple of years to ask a couple of random questions isn't going to get it done.
    need help to sort and remove duplicate entries-leo2008-jpg
    8 years of posts here, and for what?

    TBH, you're just a help vampire with a very slow burning fuse.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Remove duplicate strings from char array
    By jstrahn in forum C Programming
    Replies: 3
    Last Post: 04-15-2014, 10:33 PM
  2. Replies: 7
    Last Post: 07-28-2009, 03:15 PM
  3. remove duplicate words from linked list
    By rocketman03 in forum C Programming
    Replies: 8
    Last Post: 11-22-2008, 07:47 PM
  4. Remove Duplicate Functions
    By dhartwig1023 in forum C++ Programming
    Replies: 1
    Last Post: 09-12-2002, 01:21 PM
  5. How do I remove duplicate entries in an array?
    By Unregistered in forum C Programming
    Replies: 1
    Last Post: 06-18-2002, 09:49 AM

Tags for this Thread