Note: My code in
post #27 has a bug in the output: it prints the output parts in the reverse order. Lines 272-276 should be
Code:
int d;
*(--p) = '\0';
*(--p) = '\n';
for (d = 0; d < PARTS; d++) {
Well, scalability is an issue. The example data set is just 21019 records, whereas the practical data sets are approximately two million records, or almost fifty times larger. Testing with the example dataset is not a reliable indicator on how long the programs take with the larger dataset.
So, I generated an uniformly random two-million row data set using awk,
Code:
awk 'BEGIN { n = 2000000; while (n-->0) printf "%d %d %d %d %d\n", int(1 + 35*rand()), int(1 + 35*rand()), int(1 + 35*rand()), int(1 + 35*rand()), int(1 + 35*rand()) }' > two-million.txt
and generated a packed version (using the format I described in post #27: five characters per record (row in the original file), using characters
123456789:;<=>?@ABCDEFGHIJKLMNOPQRST to describe the 35 possible values for each, with no record separators) using
Code:
awk 'NF>=5 { printf "%c%c%c%c%c", 48+$1, 48+$2, 48+$3, 48+$4, 48+$5 }' two-million.txt > two-million.packed
The text format file was 27,428,568 bytes long. The packed format file was 10,000,000 bytes long, only about 36% of the size of the text format file.
If you used e.g.
printf("%d %d %d %d %d\n", $n1, $n2, $n3, $n4, $n5) in PHP to output the records, just use
printf("%c%c%c%c%c", 48+$n1, 48+$n2, 48+$n3, 48+$n4, 48+$n5) instead to output the records in the packed format.
On my Core i5-4200U laptop, with the input files in cache,
- the mawk script took 24 seconds
- the gawk script took 10 seconds (surprise!!)
- plain sort -g two-million.txt takes 9 seconds (but uses all four threads (two cores) of this CPU; about 31 seconds user CPU time)
- my mmap code took 0.26 seconds
- my packed-input format code, below, took 0.12 seconds
It seems that gawk (GNU awk) has much better hashing algorithms than mawk does, as it is so much faster for this large data set. I wasn't aware of this before, although I normally use awk/mawk/gawk for conversion and test data generation, nothing time-sensitive.
Using
sort is not worth it; doing it all using
gawk (which uses only one CPU core for this, under 10 seconds of user CPU time) is a better option.
My memory-mapping code is faster than the gawk script by a factor of about 38.
Using the packed-input format, you could cut the processing time down to less than half, to a factor of 83 faster than the gawk script.
In Linux, you can drop all caches using
sudo sh -c 'sync; echo 3 >/proc/sys/vm/drop_caches ; sync' . Unfortunately (wrt. this test!), this laptop has a sizzling fast SSD, and the results are basically the same even if the files are not cached, because reading them just does not take long enough to show up:
- the mawk script took 24 seconds
- the gawk script took 10 seconds
- my mmap code took 0.26 seconds
- my packed-input format code, below, took 0.12 seconds
Note that when using an SSD, the initial chunk of the file is almost immediately available, whereas using spinning disks, you have all kinds of latencies, on the order of about 10 ms.
3000 files with two million records each would take about fourteen minutes to process if in readable text format, or about six minutes if in the packed format.
The version of my mmap program that uses this packed input format, but produces exactly the same (human-readable decimal number) output as my mmap program, is here:
Code:
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
#define MIN_PART 1
#define MIN_CHAR 49
#define MAX_CHAR 85
#define RADIX (MAX_CHAR - MIN_CHAR + 1)
#define LIMIT (RADIX*RADIX*RADIX*RADIX*RADIX)
#define PARTS 5
#define PART_DIGITS 2
#if defined(MAP_NORESERVE)
#define ROMAP_FLAGS MAP_SHARED | MAP_NORESERVE
#else
#define ROMAP_FLAGS MAP_SHARED
#endif
#if __GNUC__ >= 4
#define unlikely(x) __builtin_expect((x), 0)
#else
#define unlikely(x) (x)
#endif
typedef struct {
const unsigned char *data;
size_t size;
size_t len;
} romap;
static void map_release(romap *const map)
{
if (map) {
if (map->size)
munmap((void *)map->data, map->size);
map->data = NULL;
map->size = 0;
map->len = 0;
}
}
static int map_readonly(const char *const filename, romap *const map)
{
struct stat info;
void *data;
long page;
size_t size, len;
int fd, result;
if (!map)
return errno = EINVAL;
map->data = NULL;
map->len = 0;
map->size = 0;
if (!filename || !*filename)
return errno = ENOENT;
errno = 0;
page = sysconf(_SC_PAGESIZE);
if (page < 1L || errno != 0)
return errno = ENOSYS;
do {
fd = open(filename, O_RDONLY | O_NOCTTY);
} while (fd == -1 && errno == EINTR);
if (fd == -1)
return errno;
do {
result = fstat(fd, &info);
} while (result == -1 && errno == EINTR);
if (result == -1) {
const int saved_errno = errno;
close(fd);
return errno = saved_errno;
}
if (info.st_size < (off_t)1) {
close(fd);
return errno = ENODATA;
}
len = (size_t)info.st_size;
if ((off_t)len != info.st_size) {
close(fd);
return errno = ENOMEM;
}
if (len % (size_t)page)
size = len + (size_t)page - (len % (size_t)page);
else
size = len;
if (size < len) {
close(fd);
return errno = ENOMEM;
}
do {
data = mmap(NULL, size, PROT_READ, ROMAP_FLAGS, fd, (off_t)0);
} while (data == MAP_FAILED && errno == EINTR);
if (data == MAP_FAILED) {
const int saved_errno = errno;
close(fd);
return errno = saved_errno;
}
if (close(fd)) {
munmap(data, size);
return errno = EIO;
}
map->data = (const unsigned char *)data;
map->size = size;
map->len = len;
return errno = 0;
}
static int parse_ulong(const char *s, unsigned long *const val)
{
const char *end = NULL;
if (!val || !s || !*s)
return EINVAL;
errno = 0;
*val = strtoul(s, (char **)&end, 0);
if (errno || !end || end <= s)
return EINVAL;
while (*end == '\t' || *end == '\n' || *end == '\v' ||
*end == '\f' || *end == '\r' || *end == ' ')
end++;
if (*end)
return EINVAL;
return 0;
}
int main(int argc, char *argv[])
{
unsigned char *count;
romap input;
unsigned long exact;
unsigned char match;
size_t i, n;
if (argc != 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s FILENAME COUNT\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "This program reads FILENAME, which consists of\n");
fprintf(stderr, "five-character records with no record separators.\n");
fprintf(stderr, "The characters are ASCII codes 49 '1' to 85 'T':\n");
fprintf(stderr, " 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B\n");
fprintf(stderr, " C D E F G H I J K L M N O P Q R S T\n");
fprintf(stderr, "Character '1' corresponds to value 1, and\n");
fprintf(stderr, "character 'T' to value 35.\n");
fprintf(stderr, "\n");
fprintf(stderr, "Records that occur exactly COUNT times, are output as\n");
fprintf(stderr, " V1 V2 V3 V4 V5\n");
fprintf(stderr, "where V1..V5 are the decimal values corresponding to\n");
fprintf(stderr, "the five-character string.\n");
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
if (parse_ulong(argv[2], &exact)) {
fprintf(stderr, "%s: Invalid COUNT.\n", argv[2]);
return EXIT_FAILURE;
}
match = (unsigned char)exact;
if ((unsigned long)match != exact || !(unsigned char)(match + 1U)) {
fprintf(stderr, "%s: COUNT is too large.\n", argv[2]);
return EXIT_FAILURE;
}
if (map_readonly(argv[1], &input)) {
fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
if (input.len % (size_t)PARTS) {
fprintf(stderr, "%s: File length is not a multiple of %d!\n", argv[1], PARTS);
return EXIT_FAILURE;
}
count = malloc((size_t)LIMIT * sizeof count[0]);
if (!count) {
fprintf(stderr, "Not enough memory for the buckets.\n");
return EXIT_FAILURE;
}
memset(count, 0, (size_t)LIMIT * sizeof count[0]);
fprintf(stderr, "Reading %s .. ", argv[1]);
fflush(stderr);
n = input.len / PARTS;
for (i = 0; i < n; i++) {
const unsigned char *p = input.data + i * (size_t)PARTS;
const unsigned char *const q = input.data + i * (size_t)PARTS + (size_t)PARTS;
size_t k = 0;
while (p < q)
if (*p >= MIN_CHAR && *p <= MAX_CHAR)
k = (size_t)RADIX * k + (size_t)(*(p++) - MIN_CHAR);
else {
fprintf(stderr, "%s: Invalid data in record %lu of %lu.\n", argv[1], 1UL + (unsigned long)i, (unsigned long)n);
return EXIT_FAILURE;
}
if (unlikely(!++count[k]))
--count[k];
}
fprintf(stderr, "%lu records.\n", (unsigned long)n);
fflush(stderr);
map_release(&input);
for (i = 0; i < (size_t)LIMIT; i++)
if (count[i] == match) {
char buffer[8 + (PART_DIGITS + 1) * PARTS];
char *p = buffer + sizeof buffer;
unsigned long u = (unsigned int)i;
int d;
*(--p) = '\0';
*(--p) = '\n';
for (d = 0; d < PARTS; d++) {
unsigned long v = (unsigned long)MIN_PART + (u % (unsigned long)RADIX);
u /= (unsigned long)RADIX;
do {
*(--p) = '0' + (v % 10UL);
v /= 10UL;
} while (v);
*(--p) = ' ';
}
fputs(p + 1, stdout);
}
return EXIT_SUCCESS;
}
Next, a threaded version of the above. Input processing is done by one or more separate threads, so this one will be slower (due to unnecessary overheads) if only one thread is used. Theoretically, it could be faster when multiple threads are used:
Code:
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
#define MIN_PART 1
#define MIN_CHAR 49
#define MAX_CHAR 85
#define RADIX (MAX_CHAR - MIN_CHAR + 1)
#define LIMIT (RADIX*RADIX*RADIX*RADIX*RADIX)
#define PARTS 5
#define PART_DIGITS 2
#define MAX_THREADS 16
#if defined(MAP_NORESERVE)
#define ROMAP_FLAGS MAP_SHARED | MAP_NORESERVE
#else
#define ROMAP_FLAGS MAP_SHARED
#endif
#if __GNUC__ >= 4
#define unlikely(x) __builtin_expect((x), 0)
#else
#define unlikely(x) (x)
#endif
typedef struct {
const unsigned char *data;
size_t size;
size_t len;
} romap;
static void map_release(romap *const map)
{
if (map) {
if (map->size)
munmap((void *)map->data, map->size);
map->data = NULL;
map->size = 0;
map->len = 0;
}
}
static int map_readonly(const char *const filename, romap *const map)
{
struct stat info;
void *data;
long page;
size_t size, len;
int fd, result;
if (!map)
return errno = EINVAL;
map->data = NULL;
map->len = 0;
map->size = 0;
if (!filename || !*filename)
return errno = ENOENT;
errno = 0;
page = sysconf(_SC_PAGESIZE);
if (page < 1L || errno != 0)
return errno = ENOSYS;
do {
fd = open(filename, O_RDONLY | O_NOCTTY);
} while (fd == -1 && errno == EINTR);
if (fd == -1)
return errno;
do {
result = fstat(fd, &info);
} while (result == -1 && errno == EINTR);
if (result == -1) {
const int saved_errno = errno;
close(fd);
return errno = saved_errno;
}
if (info.st_size < (off_t)1) {
close(fd);
return errno = ENODATA;
}
len = (size_t)info.st_size;
if ((off_t)len != info.st_size) {
close(fd);
return errno = ENOMEM;
}
if (len % (size_t)page)
size = len + (size_t)page - (len % (size_t)page);
else
size = len;
if (size < len) {
close(fd);
return errno = ENOMEM;
}
do {
data = mmap(NULL, size, PROT_READ, ROMAP_FLAGS, fd, (off_t)0);
} while (data == MAP_FAILED && errno == EINTR);
if (data == MAP_FAILED) {
const int saved_errno = errno;
close(fd);
return errno = saved_errno;
}
if (close(fd)) {
munmap(data, size);
return errno = EIO;
}
map->data = (const unsigned char *)data;
map->size = size;
map->len = len;
return errno = 0;
}
static int parse_ulong(const char *s, unsigned long *const val)
{
const char *end = NULL;
if (!val || !s || !*s)
return EINVAL;
errno = 0;
*val = strtoul(s, (char **)&end, 0);
if (errno || !end || end <= s)
return EINVAL;
while (*end == '\t' || *end == '\n' || *end == '\v' ||
*end == '\f' || *end == '\r' || *end == ' ')
end++;
if (*end)
return EINVAL;
return 0;
}
struct work {
const char *fname;
unsigned char *count;
const unsigned char *data;
size_t first;
size_t limit;
};
pthread_mutex_t error_mutex = PTHREAD_MUTEX_INITIALIZER;
void *worker(void *workref)
{
const char *const fname = ((struct work *)workref)->fname;
unsigned char *const count = ((struct work *)workref)->count;
const unsigned char *const data = ((struct work *)workref)->data;
const size_t limit = ((struct work *)workref)->limit;
size_t index = ((struct work *)workref)->first;
while (index < limit) {
const unsigned char *p = data + index * PARTS;
const unsigned char *const q = data + (index + 1) * PARTS;
size_t value = 0;
while (p < q)
if (*p >= MIN_CHAR && *p <= MAX_CHAR)
value = (size_t)RADIX * value + (size_t)(*(p++) - MIN_CHAR);
else {
pthread_mutex_lock(&error_mutex);
fprintf(stderr, "%s: Record %zu at file offset %zu is invalid.\n",
fname, index, (size_t)PARTS * index);
fflush(stderr);
pthread_mutex_unlock(&error_mutex);
return (void *)(long)EINVAL;
}
index++;
if (unlikely(!__atomic_add_fetch(count + value, (unsigned char)1U, __ATOMIC_SEQ_CST)))
__atomic_store_n(count + value, ~(unsigned char)0U, __ATOMIC_SEQ_CST);
}
return NULL;
}
int main(int argc, char *argv[])
{
pthread_t thread_id[MAX_THREADS];
struct work thread_work[MAX_THREADS];
size_t thread_count;
pthread_attr_t attrs;
unsigned char *count;
romap input;
unsigned long value;
unsigned char match;
size_t i, n;
if (argc < 3 || argc > 4 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s FILENAME COUNT [ THREADS ]\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "This program reads FILENAME, which consists of\n");
fprintf(stderr, "five-character records with no record separators.\n");
fprintf(stderr, "The characters are ASCII codes 49 '1' to 85 'T':\n");
fprintf(stderr, " 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B\n");
fprintf(stderr, " C D E F G H I J K L M N O P Q R S T\n");
fprintf(stderr, "Character '1' corresponds to value 1, and\n");
fprintf(stderr, "character 'T' to value 35.\n");
fprintf(stderr, "\n");
fprintf(stderr, "Records that occur exactly COUNT times, are output as\n");
fprintf(stderr, " V1 V2 V3 V4 V5\n");
fprintf(stderr, "where V1..V5 are the decimal values corresponding to\n");
fprintf(stderr, "the five-character string.\n");
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
if (parse_ulong(argv[2], &value)) {
fprintf(stderr, "%s: Invalid COUNT.\n", argv[2]);
return EXIT_FAILURE;
}
match = (unsigned char)value;
if ((unsigned long)match != value || !(unsigned char)(match + 1U)) {
fprintf(stderr, "%s: COUNT is too large.\n", argv[2]);
return EXIT_FAILURE;
}
if (argc > 3) {
if (parse_ulong(argv[3], &value) || value < 1UL || value >= (unsigned long)MAX_THREADS) {
fprintf(stderr, "%s: Invalid number of threads.\n", argv[3]);
return EXIT_FAILURE;
}
thread_count = (size_t)value;
} else
thread_count = 1;
if (map_readonly(argv[1], &input)) {
fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
if (input.len % (size_t)PARTS) {
fprintf(stderr, "%s: File length is not a multiple of %d!\n", argv[1], PARTS);
return EXIT_FAILURE;
}
n = input.len / (size_t)PARTS;
count = malloc((size_t)LIMIT * sizeof count[0]);
if (!count) {
fprintf(stderr, "Not enough memory for the buckets.\n");
return EXIT_FAILURE;
}
memset(count, 0, (size_t)LIMIT * sizeof count[0]);
for (i = 0; i < thread_count; i++) {
thread_work[i].fname = argv[1];
thread_work[i].data = input.data;
thread_work[i].count = count;
thread_work[i].first = (n * i) / thread_count;
thread_work[i].limit = (n * (i + 1)) / thread_count;
}
pthread_attr_init(&attrs);
if (pthread_attr_setstacksize(&attrs, 32768U))
pthread_attr_setstacksize(&attrs, 65536U);
fprintf(stderr, "Reading %s, %lu records .. ", argv[1], (unsigned long)n);
fflush(stderr);
for (i = 0; i < thread_count; i++) {
int result;
result = pthread_create(&(thread_id[i]), &attrs, worker, &thread_work[i]);
if (result) {
fprintf(stderr, "Cannot create worker thread: %s.\n", strerror(result));
while (i-->0)
pthread_cancel(thread_id[i]);
return EXIT_FAILURE;
}
}
pthread_attr_destroy(&attrs);
{ size_t errorcount = 0;
for (i = 0; i < thread_count; i++) {
void *retval = NULL;
pthread_join(thread_id[i], &retval);
if (retval)
++errorcount;
}
if (errorcount)
return EXIT_FAILURE;
}
fprintf(stderr, "Done.\n");
fflush(stderr);
map_release(&input);
for (i = 0; i < (size_t)LIMIT; i++)
if (count[i] == match) {
char buffer[8 + (PART_DIGITS + 1) * PARTS];
char *p = buffer + sizeof buffer;
unsigned long u = (unsigned int)i;
int d;
*(--p) = '\0';
*(--p) = '\n';
for (d = 0; d < PARTS; d++) {
unsigned long v = (unsigned long)MIN_PART + (u % (unsigned long)RADIX);
u /= (unsigned long)RADIX;
do {
*(--p) = '0' + (v % 10UL);
v /= 10UL;
} while (v);
*(--p) = ' ';
}
fputs(p + 1, stdout);
}
return EXIT_SUCCESS;
}
However, as I suspected, we are RAM speed or cache speed bound, and adding new CPU cores does not help -- on my machine, at least. Using only one worker thread the two million records are processed in 0.15 seconds; a bit more than using a single-threaded process, due to the overheads.
With two or more threads, the real-world time taken is 0.11 seconds, the same time taken by the single-threaded process, so the extra cores do not help. However, that might be because the i5-4200U only has two CPU cores (but four threads). In any case, this indicates that multithreading might not help in this case; although it would be useful to test on other hardware, too, to verify.
Finally, my initial suggestion (that treats the packed input format as a large number; kind of trivial hashing). This one uses qsort() from the standard C library, and due to the large number of function calls that entails, is not the fastest possible. On my machine, it takes 0.38 seconds (even though it uses the packed input format, therefore about three times slower than the single-threaded packed input program above):
Code:
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
#define MIN_PART 1
#define MIN_CHAR 49
#define MAX_CHAR 85
#define RADIX (MAX_CHAR - MIN_CHAR + 1)
#define LIMIT (RADIX*RADIX*RADIX*RADIX*RADIX)
#define PARTS 5
#define PART_DIGITS 2
#if defined(MAP_NORESERVE)
#define ROMAP_FLAGS MAP_SHARED | MAP_NORESERVE
#else
#define ROMAP_FLAGS MAP_SHARED
#endif
#if __GNUC__ >= 4
#define unlikely(x) __builtin_expect((x), 0)
#else
#define unlikely(x) (x)
#endif
typedef struct {
const unsigned char *data;
size_t size;
size_t len;
} romap;
static void map_release(romap *const map)
{
if (map) {
if (map->size)
munmap((void *)map->data, map->size);
map->data = NULL;
map->size = 0;
map->len = 0;
}
}
static int map_readonly(const char *const filename, romap *const map)
{
struct stat info;
void *data;
long page;
size_t size, len;
int fd, result;
if (!map)
return errno = EINVAL;
map->data = NULL;
map->len = 0;
map->size = 0;
if (!filename || !*filename)
return errno = ENOENT;
errno = 0;
page = sysconf(_SC_PAGESIZE);
if (page < 1L || errno != 0)
return errno = ENOSYS;
do {
fd = open(filename, O_RDONLY | O_NOCTTY);
} while (fd == -1 && errno == EINTR);
if (fd == -1)
return errno;
do {
result = fstat(fd, &info);
} while (result == -1 && errno == EINTR);
if (result == -1) {
const int saved_errno = errno;
close(fd);
return errno = saved_errno;
}
if (info.st_size < (off_t)1) {
close(fd);
return errno = ENODATA;
}
len = (size_t)info.st_size;
if ((off_t)len != info.st_size) {
close(fd);
return errno = ENOMEM;
}
if (len % (size_t)page)
size = len + (size_t)page - (len % (size_t)page);
else
size = len;
if (size < len) {
close(fd);
return errno = ENOMEM;
}
do {
data = mmap(NULL, size, PROT_READ, ROMAP_FLAGS, fd, (off_t)0);
} while (data == MAP_FAILED && errno == EINTR);
if (data == MAP_FAILED) {
const int saved_errno = errno;
close(fd);
return errno = saved_errno;
}
if (close(fd)) {
munmap(data, size);
return errno = EIO;
}
map->data = (const unsigned char *)data;
map->size = size;
map->len = len;
return errno = 0;
}
static int parse_ulong(const char *s, unsigned long *const val)
{
const char *end = NULL;
if (!val || !s || !*s)
return EINVAL;
errno = 0;
*val = strtoul(s, (char **)&end, 0);
if (errno || !end || end <= s)
return EINVAL;
while (*end == '\t' || *end == '\n' || *end == '\v' ||
*end == '\f' || *end == '\r' || *end == ' ')
end++;
if (*end)
return EINVAL;
return 0;
}
static int compare_unsigned_int(const void *ptr1, const void *ptr2)
{
if (*(const unsigned int *)ptr1 < *(const unsigned int *)ptr2)
return -1;
else
if (*(const unsigned int *)ptr1 > *(const unsigned int *)ptr2)
return +1;
else
return 0;
}
int main(int argc, char *argv[])
{
unsigned int *array;
romap input;
unsigned long value;
size_t match;
size_t i, n;
if (argc != 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s FILENAME COUNT\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "This program reads FILENAME, which consists of\n");
fprintf(stderr, "five-character records with no record separators.\n");
fprintf(stderr, "The characters are ASCII codes 49 '1' to 85 'T':\n");
fprintf(stderr, " 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B\n");
fprintf(stderr, " C D E F G H I J K L M N O P Q R S T\n");
fprintf(stderr, "Character '1' corresponds to value 1, and\n");
fprintf(stderr, "character 'T' to value 35.\n");
fprintf(stderr, "\n");
fprintf(stderr, "Records that occur exactly COUNT times, are output as\n");
fprintf(stderr, " V1 V2 V3 V4 V5\n");
fprintf(stderr, "where V1..V5 are the decimal values corresponding to\n");
fprintf(stderr, "the five-character string.\n");
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
if (parse_ulong(argv[2], &value)) {
fprintf(stderr, "%s: Invalid COUNT.\n", argv[2]);
return EXIT_FAILURE;
}
match = (size_t)value;
if ((unsigned long)match != value) {
fprintf(stderr, "%s: COUNT is too large.\n", argv[2]);
return EXIT_FAILURE;
}
if (map_readonly(argv[1], &input)) {
fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
if (input.len % (size_t)PARTS) {
fprintf(stderr, "%s: File length is not a multiple of %d!\n", argv[1], PARTS);
return EXIT_FAILURE;
}
n = input.len / PARTS;
array = malloc((size_t)n * sizeof array[0]);
if (!array) {
fprintf(stderr, "Not enough memory for the array.\n");
return EXIT_FAILURE;
}
fprintf(stderr, "Reading %s, %lu records .. ", argv[1], (unsigned long)n);
fflush(stderr);
for (i = 0; i < n; i++) {
const unsigned char *p = input.data + i * (size_t)PARTS;
const unsigned char *const q = input.data + i * (size_t)PARTS + (size_t)PARTS;
size_t k = 0;
while (p < q)
if (*p >= MIN_CHAR && *p <= MAX_CHAR)
k = (size_t)RADIX * k + (size_t)(*(p++) - MIN_CHAR);
else {
fprintf(stderr, "%s: Invalid data in record %lu of %lu.\n", argv[1], 1UL + (unsigned long)i, (unsigned long)n);
return EXIT_FAILURE;
}
array[i] = k;
}
map_release(&input);
fprintf(stderr, "done. Sorting ..");
fflush(stderr);
qsort(array, n, sizeof array[0], compare_unsigned_int);
fprintf(stderr, "done.\n");
fflush(stderr);
i = 0; while (i < n) {
size_t j = i + 1;
while (j < n && array[i] == array[j])
j++;
if (j - i == match) {
char buffer[8 + (PART_DIGITS + 1) * PARTS];
char *p = buffer + sizeof buffer;
unsigned long u = (unsigned long)(j - i);
int d;
*(--p) = '\0';
*(--p) = '\n';
for (d = 0; d < PARTS; d++) {
unsigned long v = (unsigned long)MIN_PART + (u % (unsigned long)RADIX);
u /= (unsigned long)RADIX;
do {
*(--p) = '0' + (v % 10UL);
v /= 10UL;
} while (v);
*(--p) = ' ';
}
fputs(p + 1, stdout);
}
i = j;
}
return EXIT_SUCCESS;
}
Replacing the qsort() call with an optimized Quicksort function that uses Insertion sort for the short segments, might improve on that. In many cases such a Quicksort routine can be parallelized to multiple threads, so if the 0.12 second real-world time is still too large, a multithreaded optimized quicksort+insertion sort for unsigned integers with no extra payload would be my next check.
The reason I posted this message, and the above sources, is to show the odd, low-level directions you might end up when chasing performance in C. Details in the source code do not matter that much; it is the overall approach, the algorithm and the implementation design, that determines how the code performs -- and because C is such a low-level language, the performance is often tightly coupled to the architecture used.
And, we haven't talked about
distributed processing for this at all, yet. We could have several processes, possibly running on separate machines, preprocessing or partially processing these files. Each one could process their own chunk, generating a packed array of records with the number of occurrences per record, in sorted order; other processes would take these chunks, and combine them into one, also selecting only those entries that have the desired number of occurrences. While this sounds silly if you only consider a single file to be worked on, it becomes much more feasible if you have a large
set of files to process; then, distributing the processing in this fashion, especially if the input files are spread over the nodes in a computing cluster, might yield the results in the shortest possible real time elapsed.
If you have any questions on the logic or implementation of my code -- or if you notice any bugs therein --, do ask: I'd be happy to try and answer and/or fix those.