Entire array is not printing--why?

**whiteflags** · 06-19-2015

To be honest I am not sure that this task is very good for a C novice. A great volume of data can require skill to manipulate in a quick fashion, and other languages have tools that will help you implement this in a painless way (like python sets).

Continuing on with C, undaunted, I would sort the lines of the file first. That way duplicates of serial numbers will clump together and you can easily find unique ones or duplicates. Plus, with a sorted file, I would think that any program doing this would be equally performant.

An easy way to sort the file is to use an existing program. I saved your file as old.txt and did this:

Code:

 C:\Users\Josh2>sort E:/old.txt /O E:/sorted.txt

You can do what you want.

Then, with sorted.txt:
1) Read the first 5 numbers as a serial number.
2) Compare it with the next serial number.
3) If the serial numbers match, then the second serial in 2) is a duplicate, and you should count it as such.
4) If the serial numbers DON'T match, then the second serial in 2) is a new serial and you have found all the duplicates of the serial in 1).
5) Output as necessary.
6) Repeat until you are out of serial numbers.

This is faster due to the input being sorted. As you go through this process, you can output only the serial numbers that appear at least N times, or output uniques - whatever you need.

**christop** · 06-19-2015

Originally Posted by whiteflags

To be honest I am not sure that this task is very good for a C novice. A great volume of data can require skill to manipulate in a quick fashion, and other languages have tools that will help you implement this in a painless way (like python sets).

I very much agree with that. I used a single line of the Bourne shell language to sort the file and find lines duplicated 5 times:

Code:

<file.txt sort | uniq -c | grep '^ *5 '

But that gave no results because apparently the sample file doesn't have any duplicates.

**algorism** · 06-19-2015

This can be done fairly easily in C but it seems more like a scripting or shell-like thing to do. But if you want to do it in C:

Make a struct containing five ints (or unsigned chars to save space) and another int to hold the count of duplicates for that record.

Create an array of these records. Define this array globally so that it is not on the stack. It will therefore have more space by default and will also be automatically initialized to zeroes. (Alternatively we could use a pointer in main and calloc/free the storage.)

Set number_of_records to 0.

Read the file line by line into five variables (call them a,b,c,d,e).

Go through the array so far to see if the current line is a duplicate. (Here I'm assuming a linear search is good enough.)

If it is a duplicate, increment the count for the matching array element. If not, add it to the end of the array and increment the number_of_records.

When you're finished going through the file just go through the array and print out the records with the duplicate count you want.

**Charles Driver** · 06-19-2015

Originally Posted by whiteflags

To be honest I am not sure that this task is very good for a C novice. A great volume of data can require skill to manipulate in a quick fashion, and other languages have tools that will help you implement this in a painless way (like python sets).

Continuing on with C, undaunted, I would sort the lines of the file first. That way duplicates of serial numbers will clump together and you can easily find unique ones or duplicates. Plus, with a sorted file, I would think that any program doing this would be equally performant.

An easy way to sort the file is to use an existing program. I saved your file as old.txt and did this:

Code:

 C:\Users\Josh2>sort E:/old.txt /O E:/sorted.txt

You can do what you want.

Then, with sorted.txt:
1) Read the first 5 numbers as a serial number.
2) Compare it with the next serial number.
3) If the serial numbers match, then the second serial in 2) is a duplicate, and you should count it as such.
4) If the serial numbers DON'T match, then the second serial in 2) is a new serial and you have found all the duplicates of the serial in 1).
5) Output as necessary.
6) Repeat until you are out of serial numbers.

This is faster due to the input being sorted. As you go through this process, you can output only the serial numbers that appear at least N times, or output uniques - whatever you need.

The above is great info. I did not think C was good for this but I did not know...I am purely php so, any pointers on what language/shell scripting I should be looking at is highly appreciated. I am going to drop this task in C then. You suggest python sets?

**Charles Driver** · 06-19-2015

Originally Posted by algorism

This can be done fairly easily in C but it seems more like a scripting or shell-like thing to do. But if you want to do it in C:

Make a struct containing five ints (or unsigned chars to save space) and another int to hold the count of duplicates for that record.

Create an array of these records. Define this array globally so that it is not on the stack. It will therefore have more space by default and will also be automatically initialized to zeroes. (Alternatively we could use a pointer in main and calloc/free the storage.)

Set number_of_records to 0.

Read the file line by line into five variables (call them a,b,c,d,e).

Go through the array so far to see if the current line is a duplicate. (Here I'm assuming a linear search is good enough.)

If it is a duplicate, increment the count for the matching array element. If not, add it to the end of the array and increment the number_of_records.

When you're finished going through the file just go through the array and print out the records with the duplicate count you want.

Thank you Algorism*, I agree with you and the other posters ,maybe not good to do in C, do you have a suggestion for what you think is best for something like this? I need to do this Many Many times, so speed is an issue and it is just taking too long in PHP given the millions of numbers. Speedy if its under 100000 but SLOWWWW after that.

**Charles Driver** · 06-19-2015

Wonderful suggestion

, I am going to try this my self and post how it worked out.

**Charles Driver** · 06-19-2015

Originally Posted by christop

I very much agree with that. I used a single line of the Bourne shell language to sort the file and find lines duplicated 5 times:

Code:

<file.txt sort | uniq -c | grep '^ *5 '

But that gave no results because apparently the sample file doesn't have any duplicates.

So thank you, this definately worked! I have it written in PHP but it is slow because of the amount of numbers being processed, the shell script is a bit slower than the PHP, that was why I chose C, in hopes it would be faster. Do you know of any other language that would be better than C but faster than the php and bourne shell given the large set of numbers? I have to do this many times and the timing adds up to hours.

**migf1** · 06-19-2015

Originally Posted by Charles Driver

So thank you, this definately worked! I have it written in PHP but it is slow because of the amount of numbers being processed, the shell script is a bit slower than the PHP, that was why I chose C, in hopes it would be faster. Do you know of any other language that would be better than C but faster than the php and bourne shell given the large set of numbers? I have to do this many times and the timing adds up to hours.

What has been described so far (or better, what I've understood) is fairly easy to do with C. Since it's not a homework, I'm posting 300 lines of mostly un-commented code that I think does what you asked.

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

/* for reporting internal errors */
#define DBG_ERRMSG( msg )                                      \
do {                                                           \
    fprintf(stderr, "\n*** ERROR [ %s | %s():ln %d ]\n",   \
        __FILE__, __func__,  __LINE__                  \
        );                                             \
    fprintf( stderr, "*** %s\n", (msg) );                  \
    putchar( '\n' );                                       \
} while(0)

typedef struct ArrStr {
    char **strings;
    size_t len;
    size_t capacity;
} ArrStr;

/***********************************************//**
 * Create dynamically and return the duplicate of a cstring.
 * The caller is responsible for freeing up the returned duplicate.
 ***************************************************
 */
char *s_dup( const char *src )
{
    size_t sz = 0;
    char *ret = NULL;

    if ( NULL == src ) {
        DBG_ERRMSG( "NULL pointer argument" );
        return NULL;
    }

    sz = 1 + strlen( src );
    if ( NULL != (ret = malloc(sz)) ) {
        memcpy( ret, src, sz );
    }

    return ret;
}

/***********************************************//**
 *
 ***************************************************
 */
int cb_compare_strings( const void *sp1, const void *sp2 )
{
    const char *s1 = *(const char **) sp1;
    const char *s2 = *(const char **) sp2;

    if ( NULL == s1 || NULL == s2 ) {
        return 0;
    }
    return strcmp( s1, s2 );
}

/***********************************************//**
 *
 ***************************************************
 */
void arrstr_free( ArrStr **arrstr )
{
    size_t i;

    if ( NULL == arrstr ) {
        DBG_ERRMSG( "NULL pointer argument" );
        return;
    }
    if ( NULL == *arrstr ) {
        return;
    }
    if ( NULL == (*arrstr)->strings ) {
        goto ret;
    }

    for (i=0; i < (*arrstr)->len; i++) {
        free( (*arrstr)->strings[i] );
    }

ret:
    free( *arrstr );
    *arrstr = NULL;
}

/***********************************************//**
 *
 ***************************************************
 */
ArrStr *arrstr_new( size_t nahead )
{
    ArrStr *arrstr = calloc( 1, sizeof(*arrstr) );
    if ( NULL == arrstr ) {
        DBG_ERRMSG( "arrstr = calloc() failed\n" );
        return NULL;
    }

    if ( nahead < 2 ) {
        nahead = 2;
    }

    arrstr->strings = calloc( nahead, sizeof( *(arrstr->strings) ) );
    if ( NULL == arrstr->strings ) {
        DBG_ERRMSG( "arrstr->strings = calloc() failed\n" );
        free( arrstr );
        return NULL;
    }

    arrstr->len = 0;
    arrstr->capacity = nahead;
    return arrstr;
}

/***********************************************//**
 *
 ***************************************************
 */
ArrStr *arrstr_new_from_file( size_t nahead, const char *fname )
{
    ArrStr *arrstr = NULL;
    char   **temp  = NULL;
    FILE *fp = fopen( fname, "r" );
    char input[BUFSIZ] = {'\0'};
    char errmsg[BUFSIZ] = {'\0'};
    size_t i = 0, cap = 0;          /* i for length, cap for capacity */

    if ( NULL == fp ) {
        DBG_ERRMSG( "fopen() failed\n" );
        goto ret_failure;
    }

    arrstr = calloc( 1, sizeof(*arrstr) );
    if ( NULL == arrstr ) {
        DBG_ERRMSG( "arrstr = calloc() failed\n" );
        goto ret_failure;
    }

    cap = ( nahead < 2 ) ? 2 : nahead;

    arrstr->strings = calloc( cap, sizeof( *(arrstr->strings) ) );
    if ( NULL == arrstr->strings ) {
        DBG_ERRMSG( "arrst->strings = calloc() failed\n" );
        goto ret_failure;
    }

    for (i=0; fgets(input, BUFSIZ, fp); i++) {
        arrstr->len = i;
        if ( i == cap ) {
            cap += cap;
            temp = realloc( arrstr->strings, cap * sizeof(*temp) );
            if ( NULL == temp ) {
                DBG_ERRMSG( "realloc() failed" );
                goto ret_failure;
            }
            arrstr->strings = temp;
        }
        arrstr->strings[i] = s_dup( input );
        if ( NULL == arrstr->strings[i] ) {
            snprintf(
                errmsg,
                BUFSIZ,
                "s_dup() failed on %lu'th string\n",
                (unsigned long) i
                );
            DBG_ERRMSG( errmsg );
            goto ret_failure;
        }
    }
    arrstr->len = i;
    arrstr->capacity = cap;

    fclose( fp );

    return arrstr;

ret_failure:
    fclose( fp );
    arrstr_free( &arrstr );
    return NULL;
}

/***********************************************//**
 *
 ***************************************************
 */
int arrstr_append_string( ArrStr *arrstr, const char *string )
{
    char **temp = NULL;
    size_t cap = 0, len = 0;

    if ( NULL == arrstr || NULL == string ) {
        DBG_ERRMSG( "NULL pointer argument" );
        return 0;
    }

    cap = arrstr->capacity;
    if ( arrstr->len == arrstr->capacity ) {
        cap += cap;
        temp = realloc( arrstr->strings, cap * sizeof(*temp) );
        if ( NULL == temp ) {
            DBG_ERRMSG( "realloc( arrstr->strings ) failed" );
            return 0;
        }
        arrstr->strings = temp;
        arrstr->capacity = cap;
    }

    len = arrstr->len;
    arrstr->strings[len] = s_dup( string );
    if ( NULL == arrstr->strings[len] ) {
        DBG_ERRMSG( "s_dup() failed" );
        return 0;
    }
    arrstr->len++;

    return 1;
}

/***********************************************//**
 *
 ***************************************************
 */
void arrstr_dump( const ArrStr *arrstr )
{
    size_t i;

    if ( NULL == arrstr ) {
        DBG_ERRMSG( "NULL pointer argument" );
        return;
    }

    printf( "[len: %zu]\n", arrstr->len );
    printf( "[cap: %zu]\n", arrstr->capacity );
    printf( "[strings @ 0x%X]\n", arrstr->strings );
    if ( NULL != arrstr->strings ) {
        for (i=0; i < arrstr->len; i++) {
            printf( "%s", arrstr->strings[i] );
        }
    }
    putchar( '\n' );
}

/***********************************************//**
 *
 ***************************************************
 */
int arrstr_sort( ArrStr *arrstr )
{
    if ( NULL == arrstr ) {
        DBG_ERRMSG( "NULL pointer argument" );
        return 0;
    }
    if ( arrstr->len < 2 ) {
        DBG_ERRMSG( "cannot sort less than 2 strings" );
        return 0;
    }

    qsort(
        arrstr->strings,
        arrstr->len,
        sizeof( *(arrstr->strings) ),
        cb_compare_strings
        );

    return 1;
}


/***********************************************//**
 *
 ***************************************************
 */
int main( void )
{
    const size_t NTIMES = 5;                /* desired count of duplicates */
    size_t i,j;
    ArrStr *dups = NULL;                     /* to keep desired duplicates */

    /*
     * Read file-lines into arr as cstrings.
     */

    ArrStr *arr = arrstr_new_from_file(
                32,
                "charles_file_to_array.txt"
                );
    if ( NULL == arr ) {
        DBG_ERRMSG( "arrstr_new_from_file() failed" );
        return EXIT_FAILURE;
    }

    printf( "[lines read: %zu]\n", arr->len );
//    arrstr_dump( arr );

//    printf( "[sorted]\n" );
    arrstr_sort( arr );
//    arrstr_dump( arr );

    /*
     * Store in dups all duplicates that occur exactly NTIMES times.
     */

    dups = arrstr_new( 32 );
    if ( NULL == dups ) {
        DBG_ERRMSG( "arrstr_new() fialed" );
        goto exit_failure;
    }

    for (i=0; i < arr->len-1; i++) {
        for (
        j=0;
        j < arr->len-1 && 0 == strcmp(arr->strings[i], arr->strings[i+j]);
        j++
        ){
            ;
        }
        if ( NTIMES == j ) {
            arrstr_append_string( dups, arr->strings[i] );
        }
        i += (j-1);
    }

    printf( "[lines found exactly %d times: %zu]\n", NTIMES, dups->len );
//    arrstr_dump( dups );

    /* cleanup & exit */
    arrstr_free( &arr );
    arrstr_free( &dups );
    return EXIT_SUCCESS;

exit_failure:
    arrstr_free( &arr );
    arrstr_free( &dups );
    return EXIT_FAILURE;

}

The ArrStr mini-interface is pretty generic, so it may be used elsewhere too (and be enhanecd of course). However, it was written in a hurry so it may contains bugs.

That being said, speed does not depend on the language only. It also depeneds on the algorthms and the data-strutures used.

In C you must code most of the data-structures you need manually, along with their interface. Unless of course you use one or more 3rd-part libraries, which is frankly what most people do anyway. Other languages provide such libraries as part of their eco-system.

Anyway, what you asked was pretty simple, that's why I posted the code. I hope it helps you, but note that speed-wise it is pretty naive.

**Nominal Animal** · 06-19-2015

If the order of the results is not important,

Code:

mawk '{ count[$1 " " $2 " " $3 " " $4 " " $5]++ } END { for (key in count) if (count[key] == 5) print key; }' file.txt

is very, very fast. The above also works using other awk variants (gawk, any generic awk), but in my experience, mawk is the fastest one if multiple implementations are available.

If the results need to be in sorted order, then pipe the above through sort -n -k 1 -k 2 -k 3 -k 4 -k 5 or use GNU awk -compatible awk and

Code:

gawk 'BEGIN { PROCINFO["sorted_in"] = "@ind_str_asc";
              for (i=1; i<=35; i++) k[i] = sprintf("%3d", i); }
            { count[k[int($1)] k[int($2)] k[int($3)] k[int($4)] k[int($5)]]++ }
      END   { for (key in count) if (count[key] == 5) printf "%s\n", key }' file.txt

In general, this sounds like a combinatorial problem.

If speed was of the essence, I'd use POSIX low-level I/O (mmap(2) if the data is in a file, read(2) otherwise), and a custom character-by-character parser to parse each part. The input part, scanning and parsing the input, is in my experience the bottleneck here.

Since 36⁵ = 60,466,176, you only need 26 bits to describe each five-part number, exactly. In fact, you can pack five 6-bit digits (0-63, inclusive) in a 30-bit integer. Therefore, I'd use a 32-bit integer type (uint32_t from stdint.h, most likely) to represent each five-part number. These are fast and easy to manipulate and to e.g. sort.

The basic way to find the number of occurrences is to just sort the numbers. Any duplicates will be consecutive, and thus trivial to count. (A single linear pass over the array suffices.)

One interesting option would be to eschew the sorting part altogether, and simply use an array of 36⁵ (or 35⁵, if 0 is not a valid part) unsigned chars to count the number each possible combination is seen. (This assumes that only small counts are interesting; if a combination may occur more times than can be described in an unsigned char, overflow should be checked against -- for example, using count[number] += count[number] < UCHAR_MAX; to increment the count. With GCC-4.8 and later, if (__builtin_expect(!++(count[number]), 0)) --(count[number]); compiles to nice fast x86-64 code.)

That option has the benefit of exact memory use (60,466,176 or 52,521,875 bytes, respectively) and avoiding any need for sorting -- as this would essentially be a bucket sort itself. If the cache behaviour ends up being acceptable on the used architecture, this would likely be the fastest way to implement this.

**algorism** · 06-19-2015

Here's a quick-and-dirty hash table implementation using nominal's 6-bit coding scheme and a custom scanner. (It should probably be coupled with mmap, but I've just used fgets, reading line by line.)

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <inttypes.h>

// set this to about 2000000 (~30MB)
#define TABLE_SIZE    100000
#define BASEMENT_SIZE (TABLE_SIZE/10)    // 10 percent (?)

typedef uint32_t uint;

typedef struct Record {
    uint n, cnt, next;
} Record;

Record table[TABLE_SIZE]; // Relying on zeroing of global memory.
uint g_nextElement = TABLE_SIZE - BASEMENT_SIZE;

int addToTable(uint n) {
    uint h = n % (TABLE_SIZE - BASEMENT_SIZE);

    if (table[h].cnt == 0) { // unused slot in main table
        table[h].n = n;
        table[h].cnt = 1;
        return 1;
    }

    while (1) {
        if (table[h].n == n) {  // already in table
            ++table[h].cnt;
            return 0;
        }
        if (table[h].next == 0)
            break;
        h = table[h].next;
    }

    if (g_nextElement >= TABLE_SIZE) {
        fprintf(stderr, "Error: hash table overflow\n");
        exit(EXIT_FAILURE);
    }
    h = table[h].next = g_nextElement++;
    table[h].n = n;
    table[h].cnt = 1;
    return 1;
}

void printWithCnt(uint n) {
    for (uint i=0; i<TABLE_SIZE; ++i)
        if (table[i].cnt == n) {
            uint n = table[i].n;
            printf("%u %u %u %u %u\n",
                (n >> 24) & 0x3f,
                (n >> 18) & 0x3f,
                (n >> 12) & 0x3f,
                (n >>  6) & 0x3f,
                 n        & 0x3f);
        }
}

uint scan_line(char *s) {
    uint ret = 0, n = 0;
    for ( ; *s; ++s) {
        if (*s>='0' && *s<='9')
            n = 10*n + (*s - '0');
        else {
            ret = ret << 6 | n;
            n = 0;
        }
    }
    return ret;
}

int main() {
    char line[BUFSIZ];
    uint szFile = 0, szTable = 0;

    FILE *fp = fopen("serials.txt", "r");
    if (!fp) {
        fprintf(stderr, "Error: can't open file\n");
        exit(EXIT_FAILURE);
    }

    while (fgets(line, sizeof line, fp)) {
        if (addToTable(scan_line(line)))
            ++szTable;
        ++szFile;
    }
    fclose(fp);

    printf("szTable: %u   szFile: %u\n", (unsigned)szTable, (unsigned)szFile);

    printWithCnt(5);

    return 0;
}

If you don't have uint32_t (from c99) you can just use unsigned int.
The hash table's pretty simplistic.
I didn't even use pointers but stuck with indices.
I'm not sure if that makes it slower, but at least uint32_t indices are only 32 bits even on a 64 bit machine, if that matters.

EDIT: Tried to make the code a little less wonky.

**algorism** · 06-20-2015

Here's an attempt at nominal's idea of using mmap and a giant unsigned char array of counters.

Code:

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

/*
Each input line contains 5 strictly-increasing numbers from 1 to 35, inclusive.
Subtracting 1 makes them from 0 to 34 (i.e., base 35).
Highest input number is:
  31 32 33 34 35
Subtracting 1:
  30 31 32 33 34
Decimal value:
  30*35^4 + 31*35^3 + 32*35^2 + 33*35 + 34
  (((30*35 + 31)*35 + 32)*35 + 33)*35 + 34
  == 46388264
*/

#define SIZE 46388264 // 46 388 264
typedef uint32_t uint;

unsigned char cnt[SIZE];

void err(char *s) {
    fprintf(stderr, "Error: %s\n", s);
    exit(EXIT_FAILURE);
}

void print(uint n) {
    int a = n % 35;    n /= 35;
    int b = n % 35;    n /= 35;
    int c = n % 35;    n /= 35;
    int d = n % 35;    n /= 35;
    printf("%u %u %u %u %u\n", n+1, d+1, c+1, b+1, a+1);
}

int main(int argc, char **argv) {
    int fd, x, i, line=0;
    struct stat sb;
    char *addr, *p;

    if (argc != 3) {
        printf("Usage: prog input_file num_repeats\n");
        return EXIT_FAILURE;
    }

    fd = open(argv[1], O_RDONLY);
    if (fd == -1) err("open");

    if (fstat(fd, &sb) == -1) err("fstat");

    addr = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    if (addr == MAP_FAILED) err("mmap");

    for (p = addr; p < addr + sb.st_size; ++p) {

        uint val = 0, n = 0;
        for ( ; *p != '\n'; ++p) {
            if (*p>='0' && *p<='9')
                n = 10*n + (*p - '0');
            else {
                val = (35 * val) + (n - 1);
                n = 0;
            }
        }
        val = (35 * val) + (n - 1);

        if (cnt[val] == (unsigned char)255U)
            printf("counter %u overflowed\n", val);
        else
            ++cnt[val];

        if (++line % 100000 == 0) printf("%d\n", line);
    }

    x = (unsigned char)atoi(argv[2]);
    for (i=0; i<SIZE; ++i)
        if (cnt[i] == x)
            print(i);

    munmap(addr, sb.st_size);
    close(fd);

    return 0;
}

If you're dealing with a different end-of-line convention (e.g., windows) the *p != '\n' should I think be *p != '\r' and you'll have to skip the following '\n' too.

**Nominal Animal** · 06-20-2015

Algorism, did you compare their runtimes on your machine? I'm too lazy to do that myself, but if you have, I'd be interested in seeing the results.

My own implementation differs a bit. (Oh, and this one supports all newline conventions -- LF, CR, CR LF, and LF CR --, and even counts the lines, to help the user in pinpointing the error locations if there is a problem in the input file.)

Code:

#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>

#define   MIN_PART      1U
#define   MAX_PART      35U
#define   PARTS         5
#define   RADIX        (MAX_PART - MIN_PART + 1)
#define   LIMIT         52521875    /* RADIX ** PARTS */
#define   PART_DIGITS   2

#if defined(MAP_NORESERVE)
#define   ROMAP_FLAGS   MAP_SHARED | MAP_NORESERVE
#else
#define   ROMAP_FLAGS   MAP_SHARED
#endif

#if __GNUC__ >= 4
#define   unlikely(x)   __builtin_expect((x), 0)
#else
#define   unlikely(x)   (x)
#endif

typedef struct {
    const unsigned char *data;
    size_t               size;
    size_t               len;
} romap;

static void map_release(romap *const map)
{
    if (map) {
        if (map->size)
            munmap((void *)map->data, map->size);
        map->data = NULL;
        map->size = 0;
        map->len  = 0;
    }
}

static int map_readonly(const char *const filename, romap *const map)
{
    struct stat info;
    void       *data;
    long        page;
    size_t      size, len;
    int         fd, result;

    if (!map)
        return errno = EINVAL;

    map->data = NULL;
    map->len  = 0;
    map->size = 0;

    if (!filename || !*filename)
        return errno = ENOENT;

    errno = 0;
    page = sysconf(_SC_PAGESIZE);
    if (page < 1L || errno != 0)
        return errno = ENOSYS;

    do {
        fd = open(filename, O_RDONLY | O_NOCTTY);
    } while (fd == -1 && errno == EINTR);
    if (fd == -1)
        return errno;

    do {
        result = fstat(fd, &info);
    } while (result == -1 && errno == EINTR);
    if (result == -1) {
        const int saved_errno = errno;
        close(fd);
        return errno = saved_errno;
    }
    if (info.st_size < (off_t)1) {
        close(fd);
        return errno = ENODATA;
    }

    len = (size_t)info.st_size;
    if ((off_t)len != info.st_size) {
        close(fd);
        return errno = ENOMEM;
    }

    if (len % (size_t)page)
        size = len + (size_t)page - (len % (size_t)page);
    else
        size = len;
    if (size < len) {
        close(fd);
        return errno = ENOMEM;
    }

    do {
        data = mmap(NULL, size, PROT_READ, ROMAP_FLAGS, fd, (off_t)0);
    } while (data == MAP_FAILED && errno == EINTR);
    if (data == MAP_FAILED) {
        const int saved_errno = errno;
        close(fd);
        return errno = saved_errno;
    }

    if (close(fd)) {
        munmap(data, size);
        return errno = EIO;
    }

    map->data = (const unsigned char *)data;
    map->size = size;
    map->len  = len;

    return errno = 0;
}

static int parse_ulong(const char *s, unsigned long *const val)
{
    const char *end = NULL;

    if (!val || !s || !*s)
        return EINVAL;

    errno = 0;
    *val = strtoul(s, (char **)&end, 0);
    if (errno || !end || end <= s)
        return EINVAL;

    while (*end == '\t' || *end == '\n' || *end == '\v' ||
           *end == '\f' || *end == '\r' || *end == ' ')
        end++;

    if (*end)
        return EINVAL;

    return 0;
}


int main(int argc, char *argv[])
{
    unsigned char *count;
    romap          input;
    unsigned long  exact;
    unsigned char  match;
    size_t         i;
 
    if (argc != 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s FILENAME COUNT\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program reads lines of form\n");
        fprintf(stderr, "       N1 N2 N3 N4 N5\n");
        fprintf(stderr, "where N1..N5 are all positive decimal numbers\n");
        fprintf(stderr, "between 1 and 35, inclusive. Entries that occur\n");
        fprintf(stderr, "exactly COUNT times will be output in ascending order.\n");
        fprintf(stderr, "\n");
        return EXIT_FAILURE;
    }

    if (parse_ulong(argv[2], &exact)) {
        fprintf(stderr, "%s: Invalid COUNT.\n", argv[2]);
        return EXIT_FAILURE;
    }
    match = (unsigned char)exact;
    if ((unsigned long)match != exact || !(unsigned char)(match + 1U)) {
        fprintf(stderr, "%s: COUNT is too large.\n", argv[2]);
        return EXIT_FAILURE;
    }
    
    if (map_readonly(argv[1], &input)) {
        fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
        return EXIT_FAILURE;
    }

    count = malloc((size_t)LIMIT * sizeof count[0]);
    if (!count) {
        fprintf(stderr, "Not enough memory for the 36^5 buckets.\n");
        return EXIT_FAILURE;
    }
    memset(count, 0, (size_t)LIMIT * sizeof count[0]);

    fprintf(stderr, "Reading %s .. ", argv[1]);
    fflush(stderr);

    {
        unsigned long              lineno = 1UL;
        unsigned long              numbers = 0UL;
        const unsigned char       *p = input.data;
        const unsigned char *const q = input.data + input.len;
        int                        d;

        while (p < q) {

            while (p < q && (*p == '\t' || *p == '\v' ||
                             *p == '\f' || *p == ' '))
                ++p;
            if (p >= q)
                break;

            if (*p == '\n') {
                if (++p < q && *p == '\r')
                    ++p;
                if (p < q) {
                    lineno++;
                    continue;
                } else
                    break;
            } else
            if (*p == '\r') {
                if (++p < q && *p == '\n')
                    ++p;
                if (p < q) {
                    lineno++;
                    continue;
                } else
                    break;
            }

            for (i = 0, d = 0; d < PARTS; d++) {
                unsigned int u = 0U;

                if (p >= q) {
                    fprintf(stderr, "%s: Line %lu: Premature end of line.\n", argv[1], lineno);
                    return EXIT_FAILURE;
                }

                while (p < q && *p >= '0' && *p <= '9')
                    u = 10*u + (unsigned int)(*(p++) - '0');

                if (u < (unsigned int)MIN_PART || u > (unsigned int)MAX_PART) {
                    fprintf(stderr, "%s: Line %lu: Value out of bounds.\n", argv[1], lineno);
                    return EXIT_FAILURE;
                }

                i = (size_t)RADIX * i + (size_t)(u - (unsigned int)MIN_PART);

                while (p < q && (*p == '\t' || *p == '\v' || *p == '\f' || *p == ' '))
                    p++;
            }

            numbers++;

            if (unlikely(!++count[i]))
                --count[i];

            while (p < q && *p != '\n' && *p != '\r')
                ++p;
        }

        fprintf(stderr, "%lu lines, %lu records.\n", lineno, numbers);
        fflush(stderr);
    }

    map_release(&input);

    for (i = 0; i < (size_t)LIMIT; i++)
        if (count[i] == match) {
            char            buffer[8 + (PART_DIGITS + 1) * PARTS];
            char           *p = buffer + sizeof buffer;
            unsigned long   u = (unsigned int)i;
            int             d = PARTS;

            *(--p) = '\0';
            *(--p) = '\n';
            while (d-->0) {
                unsigned long v = (unsigned long)MIN_PART + (u % (unsigned long)RADIX);
                u /= (unsigned long)RADIX;

                do {
                    *(--p) = '0' + (v % 10UL);
                    v /= 10UL;
                } while (v);

                *(--p) = ' ';
            }

            fputs(p + 1, stdout);
        }

    return EXIT_SUCCESS;
}

Originally, I was suspicious of whether the scattered binning would have bad cache behaviour, but it looks like the bin accesses are far enough apart so that the other work (parsing each input line) covers the cacheline access latency just fine.

On this Core i5 4200U laptop I'm using right now, the 21019-line example file linked to in post #13 is processed within a millisecond, if the file is already cached. Process startup overhead is about 35ms real time (30ms user CPU time, 4ms kernel CPU time). Since the algorithm is essentially linear wrt. input size, two million lines should take less than 0.1s real time, if the file is already cached.

Anyone testing the approaches will immediately see that the I/O is the bottleneck here. My version builds even the output lines in a buffer in reverse order, because standard stdio.h I/O is just that slow -- if you are really trying to minimize the total time used. (Floating-point data, encountered often in scientific work, is atrociously slow to parse using standard library interfaces; we're talking about an order of magnitude speed-up in many cases, with identical results.)

That also means that using a fixed one-byte binary output format from that other program -- just 5 bytes per record; no byte-order issues! -- would allow a further speedup, for two reasons: the file size would be significantly smaller, and the parsing would be trivial. For example, one could use ASCII codes 49 '1' to 84 'T' (123456789:;<=>?@ABCDEFGHIJKLMNOPQRST) so that each record would be 5 characters exactly, and the file could be read and written in any programming language (that support reading very long lines).

In that case, most records could also be parsed in sets of four, eight, or sixteen, as a vector, allowing for some very clever tricks for the conversion, speeding up the process even more. The cache effects would need investigating, however; most likely, we'll need to do other work between accessing each counter bin, to allow the CPU time to cache the necessary lines without inducing latencies -- unnecessary waiting. One way to accomplish that is to handle multiple vectors of records at a time, but staggered in time. It gets pretty complicated very fast, but sometimes you can gain a 2x - 4x speedup that way. Here, in normal cases, we're I/O- or cache-bound, so vectorization of the code might not be that useful after all.

It is also quite possible to use more than one thread to do the processing here, for example by splitting the input file into roughly equal-sized parts. If the records were fixed size, that would be even easier. However, the performance benefits might not be that great, because of the I/O- and/or RAM/cache-boundedness. Adding more CPU cores just might not help much. Besides, it is unclear whether one would do better by using atomic accesses to a single array, or whether each thread should have their own bin array.

In my opinion, the problem as posed here is just the butt end of the real problem to be solved. Much better results could be had, if the hidden, untalked of part of the problem could be accessed/modified, too. In particular, by switching to smaller, fixed-size five-byte records (or even six-byte, if you insist on having '\n' after each record).

**Charles Driver** · 06-20-2015

Originally Posted by migf1

What has been described so far (or better, what I've understood) is fairly easy to do with C. Since it's not a homework, I'm posting 300 lines of mostly un-commented code that I think does what you asked.

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

/* for reporting internal errors */
#define DBG_ERRMSG( msg )                                      \
do {                                                           \
    fprintf(stderr, "\n*** ERROR [ %s | %s():ln %d ]\n",   \
        __FILE__, __func__,  __LINE__                  \
        );                                             \
    fprintf( stderr, "*** %s\n", (msg) );                  \
    putchar( '\n' );                                       \
} while(0)

typedef struct ArrStr {
    char **strings;
    size_t len;
    size_t capacity;
} ArrStr;

/***********************************************//**
 * Create dynamically and return the duplicate of a cstring.
 * The caller is responsible for freeing up the returned duplicate.
 ***************************************************
 */
char *s_dup( const char *src )
{
    size_t sz = 0;
    char *ret = NULL;

    if ( NULL == src ) {
        DBG_ERRMSG( "NULL pointer argument" );
        return NULL;
    }

    sz = 1 + strlen( src );
    if ( NULL != (ret = malloc(sz)) ) {
        memcpy( ret, src, sz );
    }

    return ret;
}

/***********************************************//**
 *
 ***************************************************
 */
int cb_compare_strings( const void *sp1, const void *sp2 )
{
    const char *s1 = *(const char **) sp1;
    const char *s2 = *(const char **) sp2;

    if ( NULL == s1 || NULL == s2 ) {
        return 0;
    }
    return strcmp( s1, s2 );
}

/***********************************************//**
 *
 ***************************************************
 */
void arrstr_free( ArrStr **arrstr )
{
    size_t i;

    if ( NULL == arrstr ) {
        DBG_ERRMSG( "NULL pointer argument" );
        return;
    }
    if ( NULL == *arrstr ) {
        return;
    }
    if ( NULL == (*arrstr)->strings ) {
        goto ret;
    }

    for (i=0; i < (*arrstr)->len; i++) {
        free( (*arrstr)->strings[i] );
    }

ret:
    free( *arrstr );
    *arrstr = NULL;
}

/***********************************************//**
 *
 ***************************************************
 */
ArrStr *arrstr_new( size_t nahead )
{
    ArrStr *arrstr = calloc( 1, sizeof(*arrstr) );
    if ( NULL == arrstr ) {
        DBG_ERRMSG( "arrstr = calloc() failed\n" );
        return NULL;
    }

    if ( nahead < 2 ) {
        nahead = 2;
    }

    arrstr->strings = calloc( nahead, sizeof( *(arrstr->strings) ) );
    if ( NULL == arrstr->strings ) {
        DBG_ERRMSG( "arrstr->strings = calloc() failed\n" );
        free( arrstr );
        return NULL;
    }

    arrstr->len = 0;
    arrstr->capacity = nahead;
    return arrstr;
}

/***********************************************//**
 *
 ***************************************************
 */
ArrStr *arrstr_new_from_file( size_t nahead, const char *fname )
{
    ArrStr *arrstr = NULL;
    char   **temp  = NULL;
    FILE *fp = fopen( fname, "r" );
    char input[BUFSIZ] = {'\0'};
    char errmsg[BUFSIZ] = {'\0'};
    size_t i = 0, cap = 0;          /* i for length, cap for capacity */

    if ( NULL == fp ) {
        DBG_ERRMSG( "fopen() failed\n" );
        goto ret_failure;
    }

    arrstr = calloc( 1, sizeof(*arrstr) );
    if ( NULL == arrstr ) {
        DBG_ERRMSG( "arrstr = calloc() failed\n" );
        goto ret_failure;
    }

    cap = ( nahead < 2 ) ? 2 : nahead;

    arrstr->strings = calloc( cap, sizeof( *(arrstr->strings) ) );
    if ( NULL == arrstr->strings ) {
        DBG_ERRMSG( "arrst->strings = calloc() failed\n" );
        goto ret_failure;
    }

    for (i=0; fgets(input, BUFSIZ, fp); i++) {
        arrstr->len = i;
        if ( i == cap ) {
            cap += cap;
            temp = realloc( arrstr->strings, cap * sizeof(*temp) );
            if ( NULL == temp ) {
                DBG_ERRMSG( "realloc() failed" );
                goto ret_failure;
            }
            arrstr->strings = temp;
        }
        arrstr->strings[i] = s_dup( input );
        if ( NULL == arrstr->strings[i] ) {
            snprintf(
                errmsg,
                BUFSIZ,
                "s_dup() failed on %lu'th string\n",
                (unsigned long) i
                );
            DBG_ERRMSG( errmsg );
            goto ret_failure;
        }
    }
    arrstr->len = i;
    arrstr->capacity = cap;

    fclose( fp );

    return arrstr;

ret_failure:
    fclose( fp );
    arrstr_free( &arrstr );
    return NULL;
}

/***********************************************//**
 *
 ***************************************************
 */
int arrstr_append_string( ArrStr *arrstr, const char *string )
{
    char **temp = NULL;
    size_t cap = 0, len = 0;

    if ( NULL == arrstr || NULL == string ) {
        DBG_ERRMSG( "NULL pointer argument" );
        return 0;
    }

    cap = arrstr->capacity;
    if ( arrstr->len == arrstr->capacity ) {
        cap += cap;
        temp = realloc( arrstr->strings, cap * sizeof(*temp) );
        if ( NULL == temp ) {
            DBG_ERRMSG( "realloc( arrstr->strings ) failed" );
            return 0;
        }
        arrstr->strings = temp;
        arrstr->capacity = cap;
    }

    len = arrstr->len;
    arrstr->strings[len] = s_dup( string );
    if ( NULL == arrstr->strings[len] ) {
        DBG_ERRMSG( "s_dup() failed" );
        return 0;
    }
    arrstr->len++;

    return 1;
}

/***********************************************//**
 *
 ***************************************************
 */
void arrstr_dump( const ArrStr *arrstr )
{
    size_t i;

    if ( NULL == arrstr ) {
        DBG_ERRMSG( "NULL pointer argument" );
        return;
    }

    printf( "[len: %zu]\n", arrstr->len );
    printf( "[cap: %zu]\n", arrstr->capacity );
    printf( "[strings @ 0x%X]\n", arrstr->strings );
    if ( NULL != arrstr->strings ) {
        for (i=0; i < arrstr->len; i++) {
            printf( "%s", arrstr->strings[i] );
        }
    }
    putchar( '\n' );
}

/***********************************************//**
 *
 ***************************************************
 */
int arrstr_sort( ArrStr *arrstr )
{
    if ( NULL == arrstr ) {
        DBG_ERRMSG( "NULL pointer argument" );
        return 0;
    }
    if ( arrstr->len < 2 ) {
        DBG_ERRMSG( "cannot sort less than 2 strings" );
        return 0;
    }

    qsort(
        arrstr->strings,
        arrstr->len,
        sizeof( *(arrstr->strings) ),
        cb_compare_strings
        );

    return 1;
}


/***********************************************//**
 *
 ***************************************************
 */
int main( void )
{
    const size_t NTIMES = 5;                /* desired count of duplicates */
    size_t i,j;
    ArrStr *dups = NULL;                     /* to keep desired duplicates */

    /*
     * Read file-lines into arr as cstrings.
     */

    ArrStr *arr = arrstr_new_from_file(
                32,
                "charles_file_to_array.txt"
                );
    if ( NULL == arr ) {
        DBG_ERRMSG( "arrstr_new_from_file() failed" );
        return EXIT_FAILURE;
    }

    printf( "[lines read: %zu]\n", arr->len );
//    arrstr_dump( arr );

//    printf( "[sorted]\n" );
    arrstr_sort( arr );
//    arrstr_dump( arr );

    /*
     * Store in dups all duplicates that occur exactly NTIMES times.
     */

    dups = arrstr_new( 32 );
    if ( NULL == dups ) {
        DBG_ERRMSG( "arrstr_new() fialed" );
        goto exit_failure;
    }

    for (i=0; i < arr->len-1; i++) {
        for (
        j=0;
        j < arr->len-1 && 0 == strcmp(arr->strings[i], arr->strings[i+j]);
        j++
        ){
            ;
        }
        if ( NTIMES == j ) {
            arrstr_append_string( dups, arr->strings[i] );
        }
        i += (j-1);
    }

    printf( "[lines found exactly %d times: %zu]\n", NTIMES, dups->len );
//    arrstr_dump( dups );

    /* cleanup & exit */
    arrstr_free( &arr );
    arrstr_free( &dups );
    return EXIT_SUCCESS;

exit_failure:
    arrstr_free( &arr );
    arrstr_free( &dups );
    return EXIT_FAILURE;

}

The ArrStr mini-interface is pretty generic, so it may be used elsewhere too (and be enhanecd of course). However, it was written in a hurry so it may contains bugs.

That being said, speed does not depend on the language only. It also depeneds on the algorthms and the data-strutures used.

In C you must code most of the data-structures you need manually, along with their interface. Unless of course you use one or more 3rd-part libraries, which is frankly what most people do anyway. Other languages provide such libraries as part of their eco-system.

Anyway, what you asked was pretty simple, that's why I posted the code. I hope it helps you, but note that speed-wise it is pretty naive.

This ABSOLUTELY helps me! Particularly your pointing to the importance of the algorithm and data-structure as components of speed. I appreciate the code and I have had time to step through it carefully. I see your point in that this was not trivial and a bit beyond novice, so I will have to think about this going forward. Cheers.

**Charles Driver** · 06-20-2015

Originally Posted by Nominal Animal

If the order of the results is not important,

Code:

mawk '{ count[$1 " " $2 " " $3 " " $4 " " $5]++ } END { for (key in count) if (count[key] == 5) print key; }' file.txt

is very, very fast. The above also works using other awk variants (gawk, any generic awk), but in my experience, mawk is the fastest one if multiple implementations are available.

If the results need to be in sorted order, then pipe the above through sort -n -k 1 -k 2 -k 3 -k 4 -k 5 or use GNU awk -compatible awk and

Code:

gawk 'BEGIN { PROCINFO["sorted_in"] = "@ind_str_asc";
              for (i=1; i<=35; i++) k[i] = sprintf("%3d", i); }
            { count[k[int($1)] k[int($2)] k[int($3)] k[int($4)] k[int($5)]]++ }
      END   { for (key in count) if (count[key] == 5) printf "%s\n", key }' file.txt

In general, this sounds like a combinatorial problem.

If speed was of the essence, I'd use POSIX low-level I/O (mmap(2) if the data is in a file, read(2) otherwise), and a custom character-by-character parser to parse each part. The input part, scanning and parsing the input, is in my experience the bottleneck here.

Since 36⁵ = 60,466,176, you only need 26 bits to describe each five-part number, exactly. In fact, you can pack five 6-bit digits (0-63, inclusive) in a 30-bit integer. Therefore, I'd use a 32-bit integer type (uint32_t from stdint.h, most likely) to represent each five-part number. These are fast and easy to manipulate and to e.g. sort.

The basic way to find the number of occurrences is to just sort the numbers. Any duplicates will be consecutive, and thus trivial to count. (A single linear pass over the array suffices.)

One interesting option would be to eschew the sorting part altogether, and simply use an array of 36⁵ (or 35⁵, if 0 is not a valid part) unsigned chars to count the number each possible combination is seen. (This assumes that only small counts are interesting; if a combination may occur more times than can be described in an unsigned char, overflow should be checked against -- for example, using count[number] += count[number] < UCHAR_MAX; to increment the count. With GCC-4.8 and later, if (__builtin_expect(!++(count[number]), 0)) --(count[number]); compiles to nice fast x86-64 code.)

That option has the benefit of exact memory use (60,466,176 or 52,521,875 bytes, respectively) and avoiding any need for sorting -- as this would essentially be a bucket sort itself. If the cache behaviour ends up being acceptable on the used architecture, this would likely be the fastest way to implement this.

I am really happy you took the time to explain how the shell scripting can be a fast solution to the goal. I would have never considered going this way and I am open to all types of solutions. It will be very interesting to benchmark it against C.

THANK YOU.

**Charles Driver** · 06-20-2015

Originally Posted by algorism

Here's a quick-and-dirty hash table implementation using nominal's 6-bit coding scheme and a custom scanner. (It should probably be coupled with mmap, but I've just used fgets, reading line by line.)

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <inttypes.h>

// set this to about 2000000 (~30MB)
#define TABLE_SIZE    100000
#define BASEMENT_SIZE (TABLE_SIZE/10)    // 10 percent (?)

typedef uint32_t uint;

typedef struct Record {
    uint n, cnt, next;
} Record;

Record table[TABLE_SIZE]; // Relying on zeroing of global memory.
uint g_nextElement = TABLE_SIZE - BASEMENT_SIZE;

int addToTable(uint n) {
    uint h = n % (TABLE_SIZE - BASEMENT_SIZE);

    if (table[h].cnt == 0) { // unused slot in main table
        table[h].n = n;
        table[h].cnt = 1;
        return 1;
    }

    while (1) {
        if (table[h].n == n) {  // already in table
            ++table[h].cnt;
            return 0;
        }
        if (table[h].next == 0)
            break;
        h = table[h].next;
    }

    if (g_nextElement >= TABLE_SIZE) {
        fprintf(stderr, "Error: hash table overflow\n");
        exit(EXIT_FAILURE);
    }
    h = table[h].next = g_nextElement++;
    table[h].n = n;
    table[h].cnt = 1;
    return 1;
}

void printWithCnt(uint n) {
    for (uint i=0; i<TABLE_SIZE; ++i)
        if (table[i].cnt == n) {
            uint n = table[i].n;
            printf("%u %u %u %u %u\n",
                (n >> 24) & 0x3f,
                (n >> 18) & 0x3f,
                (n >> 12) & 0x3f,
                (n >>  6) & 0x3f,
                 n        & 0x3f);
        }
}

uint scan_line(char *s) {
    uint ret = 0, n = 0;
    for ( ; *s; ++s) {
        if (*s>='0' && *s<='9')
            n = 10*n + (*s - '0');
        else {
            ret = ret << 6 | n;
            n = 0;
        }
    }
    return ret;
}

int main() {
    char line[BUFSIZ];
    uint szFile = 0, szTable = 0;

    FILE *fp = fopen("serials.txt", "r");
    if (!fp) {
        fprintf(stderr, "Error: can't open file\n");
        exit(EXIT_FAILURE);
    }

    while (fgets(line, sizeof line, fp)) {
        if (addToTable(scan_line(line)))
            ++szTable;
        ++szFile;
    }
    fclose(fp);

    printf("szTable: %u   szFile: %u\n", (unsigned)szTable, (unsigned)szFile);

    printWithCnt(5);

    return 0;
}

If you don't have uint32_t (from c99) you can just use unsigned int.
The hash table's pretty simplistic.
I didn't even use pointers but stuck with indices.
I'm not sure if that makes it slower, but at least uint32_t indices are only 32 bits even on a 64 bit machine, if that matters.

EDIT: Tried to make the code a little less wonky.

Thank you for this code! I was able to step through it and learn a bit. You and migf1, rocked the code because it helped me see what will be needed and your suggestions were valuable to me as a novice. Aprreciated.