Thread: delete all lines that have more than one occurrence on a file

  1. #1
    Registered User
    Join Date
    Feb 2010
    Posts
    72

    delete all lines that have more than one occurrence on a file

    Hi all,

    I have a input csv file, the lines are like this (in.csv):

    Code:
    0123456789;CUST098WZAX;35
    0123450123;CUST056TVZE;37
    0123458989;CUST034XMKS;67
    0123456789;CUST098WZAX;35
    0123458989;CUST034XMKS;67
    And i want to remove all the lines that appears more that ones in the file, this file should produce this (out.csv):

    Code:
    0123450123;CUST056TVZE;37
    I try this code but i have some trouble to fixe removing the first occurrence. your help is welcome.

    Here is my code :

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    struct somehash {
        struct somehash *next;
            unsigned hash;
            char *mem;
    };
    
    #define THE_SIZE 100000
    
    struct somehash *table[THE_SIZE] = { NULL,};
    
    struct somehash **some_find(char *str, unsigned len);
    static unsigned some_hash(char *str, unsigned len);
    
    int main (void)
    {
        char buffer[100];
        struct somehash **pp;
        size_t len;
        FILE * pFileIn;
        FILE * pFileOut;
    
        pFileIn  = fopen("in.csv", "r");
        pFileOut  = fopen("out.csv", "w+");
    
        if (pFileIn==NULL) perror ("Error opening input file");
        if (pFileOut==NULL) perror ("Error opening output file");
    
        while (fgets(buffer, sizeof buffer, pFileIn)) {
                len = strlen(buffer);
                pp = some_find(buffer, len);
                if (*pp) { /* found */
                    fprintf(stderr, "Duplicate:%s\n", buffer);
                    }
                else    
            {       /* not found: create one */
                        fprintf(stdout, "%s", buffer);
                        fprintf(pFileOut, "%s", buffer);
                        *pp = malloc(sizeof **pp);
                        (*pp)->next = NULL;
                        (*pp)->hash = some_hash(buffer,len);
                        (*pp)->mem = malloc(1+len);
                        memcpy((*pp)->mem , buffer,  1+len);
                    }
            }
    
    return 0;
    }
    
    struct somehash **some_find(char *str, unsigned len)
    {
        unsigned hash;
        unsigned short slot;
        struct somehash **hnd;
    
        hash = some_hash(str,len);
        slot = hash % THE_SIZE;
        for (hnd = &table[slot]; *hnd ; hnd = &(*hnd)->next ) {
            if ( (*hnd)->hash != hash) continue;
                if ( strcmp((*hnd)->mem , str) ) continue;
                    break;
            }
    
        return hnd;
    }
    
    static unsigned some_hash(char *str, unsigned len)
    {
        unsigned val;
        unsigned idx;
    
        if (!len) len = strlen(str);
    
        val = 0;
        for(idx=0; idx < len; idx++ )   {
                val ^= (val >> 2) ^ (val << 5) ^ (val << 13) ^ str[idx] ^ 0x80001801;
        }
    
        return val;
    }
    Regards.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Are you open to using different tools for the job?
    Code:
    $ cat wibble.txt
    0123456789;CUST098WZAX;35
    0123450123;CUST056TVZE;37
    0123458989;CUST034XMKS;67
    0123456789;CUST098WZAX;35
    0123458989;CUST034XMKS;67
    $ perl -e '%h=();while(<>){chomp;$h{$_}++;} foreach $i ( keys %h ) { print "$i\n" if ( $h{$i} == 1 ); }' wibble.txt
    0123450123;CUST056TVZE;37
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Feb 2010
    Posts
    72
    Quote Originally Posted by Salem View Post
    Are you open to using different tools for the job?
    Code:
    $ cat wibble.txt
    0123456789;CUST098WZAX;35
    0123450123;CUST056TVZE;37
    0123458989;CUST034XMKS;67
    0123456789;CUST098WZAX;35
    0123458989;CUST034XMKS;67
    $ perl -e '%h=();while(<>){chomp;$h{$_}++;} foreach $i ( keys %h ) { print "$i\n" if ( $h{$i} == 1 ); }' wibble.txt
    0123450123;CUST056TVZE;37
    Thanks Salem, but i wan't to do it in c

  4. #4
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    WIBBLE!

    That is all.

    Soma

  5. #5
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    Or
    Code:
    $ cat wibble.txt
    0123456789;CUST098WZAX;35
    0123450123;CUST056TVZE;37
    0123458989;CUST034XMKS;67
    0123456789;CUST098WZAX;35
    0123458989;CUST034XMKS;67
    $ sort wibble.txt | uniq -u
    0123450123;CUST056TVZE;37
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  6. #6
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    To answer your question about doing it in C, however, Salem has shown the way.
    Just add a count member to your struct. Set it to one when you add a node. Increment it when the same value is seen again.
    Then loop through the table and print those members that have a count of 1.

    Additionally,

    What's with the "some" prefix to your names? What does it mean?
    Calling perror does not exit, but you certainly don't want to continue if the files don't open.
    Do you need "w+" (write AND read) access or just "w" for "out.csv"?
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  7. #7
    Registered User
    Join Date
    Feb 2010
    Posts
    72
    Quote Originally Posted by phantomotap View Post
    WIBBLE!

    That is all.

    Soma
    Thanks phantomotap for the joke.

    Regards.

  8. #8
    Registered User
    Join Date
    Feb 2010
    Posts
    72
    Quote Originally Posted by oogabooga View Post
    Or
    Code:
    $ cat wibble.txt
    0123456789;CUST098WZAX;35
    0123450123;CUST056TVZE;37
    0123458989;CUST034XMKS;67
    0123456789;CUST098WZAX;35
    0123458989;CUST034XMKS;67
    $ sort wibble.txt | uniq -u
    0123450123;CUST056TVZE;37
    Thanks but as i said before, i want to do this in c (in shell it's very simple as you post)

  9. #9
    Registered User
    Join Date
    Feb 2010
    Posts
    72
    Quote Originally Posted by oogabooga View Post
    To answer your question about doing it in C, however, Salem has shown the way.
    Just add a count member to your struct. Set it to one when you add a node. Increment it when the same value is seen again.
    Then loop through the table and print those members that have a count of 1.

    Additionally,

    What's with the "some" prefix to your names? What does it mean?
    Calling perror does not exit, but you certainly don't want to continue if the files don't open.
    Do you need "w+" (write AND read) access or just "w" for "out.csv"?
    You're right it's a typo "w" is ok. Thanks for the tips. problem is now solved.

    Regards.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Delete/Overwire Lines from a text file
    By strokebow in forum C++ Programming
    Replies: 1
    Last Post: 02-14-2012, 09:02 AM
  2. How to occurrence digits in string function.
    By so6ick in forum C++ Programming
    Replies: 2
    Last Post: 10-05-2010, 11:21 PM
  3. Word occurrence logic
    By Madshan in forum C++ Programming
    Replies: 4
    Last Post: 08-13-2006, 08:53 PM
  4. Counting occurrence of numbers in C
    By stabule in forum C Programming
    Replies: 1
    Last Post: 11-13-2005, 01:55 AM
  5. Count occurrence in bin tree
    By ronenk in forum C Programming
    Replies: 2
    Last Post: 12-24-2004, 11:35 AM