If the order of the results is not important,
Code:
mawk '{ count[$1 " " $2 " " $3 " " $4 " " $5]++ } END { for (key in count) if (count[key] == 5) print key; }' file.txt
is very, very fast. The above also works using other awk variants (
gawk, any generic
awk), but in my experience,
mawk is the fastest one if multiple implementations are available.
If the results need to be in sorted order, then pipe the above through
sort -n -k 1 -k 2 -k 3 -k 4 -k 5 or use GNU awk -compatible awk and
Code:
gawk 'BEGIN { PROCINFO["sorted_in"] = "@ind_str_asc";
for (i=1; i<=35; i++) k[i] = sprintf("%3d", i); }
{ count[k[int($1)] k[int($2)] k[int($3)] k[int($4)] k[int($5)]]++ }
END { for (key in count) if (count[key] == 5) printf "%s\n", key }' file.txt
In general, this sounds like a combinatorial problem.
If speed was of the essence, I'd use POSIX low-level I/O (
mmap(2) if the data is in a file,
read(2) otherwise), and a custom character-by-character parser to parse each part. The input part, scanning and parsing the input, is in my experience the bottleneck here.
Since 36
5 = 60,466,176, you only need 26 bits to describe each five-part number, exactly. In fact, you can pack five 6-bit digits (0-63, inclusive) in a 30-bit integer. Therefore, I'd use a 32-bit integer type (
uint32_t from
stdint.h, most likely) to represent each five-part number. These are fast and easy to manipulate and to e.g. sort.
The basic way to find the number of occurrences is to just sort the numbers. Any duplicates will be consecutive, and thus trivial to count. (A single linear pass over the array suffices.)
One interesting option would be to eschew the sorting part altogether, and simply use an array of 36
5 (or 35
5, if 0 is not a valid part)
unsigned chars to count the number each possible combination is seen. (This assumes that only small counts are interesting; if a combination may occur more times than can be described in an unsigned char, overflow should be checked against -- for example, using
count[number] += count[number] < UCHAR_MAX; to increment the count. With GCC-4.8 and later,
if (__builtin_expect(!++(count[number]), 0)) --(count[number]); compiles to nice fast x86-64 code.)
That option has the benefit of exact memory use (60,466,176 or 52,521,875 bytes, respectively) and avoiding any need for sorting -- as this would essentially be a bucket sort itself. If the cache behaviour ends up being acceptable on the used architecture, this would likely be the fastest way to implement this.