Originally Posted by
hell-student
Check always the flag before a new loop cycle beginns or at the end of the section. I can't tell you now what would be better.
It should be possible to minimize the flag examination overhead, by using a separate flag variable that only transitions between zero and non-zero. That way atomicity is not an issue, and it is enough to have the flag be marked volatile. The actual cost, then, is similar to incrementing a local variable. (Minimal.)
Before we get to the example code:
I realized that I didn't mention that the actual creation of a thread is not at all expensive. Consider the following program:
Code:
#include <pthread.h>
#include <time.h>
#include <string.h>
#include <stdio.h>
#define THREAD_SETS 100000
#define THREADS_PER_SET 1
struct counter {
pthread_mutex_t lock;
unsigned long value;
};
void *worker(void *const payload)
{
struct counter *const c = (struct counter *)payload;
pthread_mutex_lock(&c->lock);
c->value++;
pthread_mutex_unlock(&c->lock);
return NULL;
}
int main(void)
{
struct counter total = { PTHREAD_MUTEX_INITIALIZER, 0UL };
struct timespec started, stopped;
double elapsed;
pthread_attr_t attrs;
pthread_t id[THREADS_PER_SET];
int i, k, result;
/* 64k stack per thread should suffice for real-world threads,
* assuming they allocate larger arrays dynamically,
* instead of relying on stack. */
pthread_attr_init(&attrs);
pthread_attr_setstacksize(&attrs, 65536);
/* Start timing. */
clock_gettime(CLOCK_REALTIME, &started);
for (k = 0; k < THREAD_SETS; k++) {
/* Create the threads. */
for (i = 0; i < THREADS_PER_SET; i++) {
result = pthread_create(&id[i], &attrs, worker, &total);
if (result) {
fprintf(stderr, "Error creating a new thread: %s.\n", strerror(result));
return 1;
}
}
/* Join the threads. */
for (i = 0; i < THREADS_PER_SET; i++) {
result = pthread_join(id[i], NULL);
if (result) {
fprintf(stderr, "Error joining a thread: %s.\n", strerror(result));
return 1;
}
}
}
pthread_attr_destroy(&attrs);
/* Stop timing. */
clock_gettime(CLOCK_REALTIME, &stopped);
elapsed = (double)(stopped.tv_sec - started.tv_sec)
+ (double)(stopped.tv_nsec - started.tv_nsec) / 1000000000.0;
printf("%lu threads (%d sets of %d threads) created in %.6f seconds real time,\n",
total.value, THREAD_SETS, THREADS_PER_SET, elapsed);
if (elapsed > 0.0)
printf("%.0f threads per second, %.9f seconds per thread on average.\n",
(double)total.value / elapsed, elapsed / (double)total.value);
return 0;
}
If you save the above as spawn.c, you can compile and run it using e.g.
Code:
gcc -W -Wall -m64 -O3 spawn.c -lpthread -lrt -o spawn
./spawn
My desktop machine (AMD Athlon X4 640) can create and join about 65000 single threads per second, or about 110000 in groups of ten. This means that the time taken to create a new thread (and join an old one, assuming the old thread has already completed), takes less than 0.000009 seconds real time on it.
Note that it is quite important to specify the per-thread stack size (see pthread_attr_t attrs in above code). While the stack given to the original thread in the process will grow dynamically, per-thread stacks are fixed in size. Because of that, the default tends to be ridiculously large, something like 8 megabytes per thread. Unless you use large local arrays, a small fraction of that is sufficient. The code above uses 64k. (The stack size is a power of two, minimum 16384 on Linux pthreads. The lower limit is not likely to be any larger on any future architectures.)
Even if you modify the program to cancel the threads asynchronously,
Code:
#include <pthread.h>
#include <time.h>
#include <string.h>
#include <stdio.h>
#define THREAD_SETS 100000
#define THREADS_PER_SET 1
struct counter {
pthread_mutex_t lock;
unsigned long value;
};
pthread_mutex_t locked = PTHREAD_MUTEX_INITIALIZER;
unsigned long failed = 0UL;
void *worker(void *const payload)
{
struct counter *const c = (struct counter *)payload;
pthread_mutex_lock(&c->lock);
c->value++;
pthread_mutex_unlock(&c->lock);
pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL);
pthread_mutex_lock(&locked);
failed++;
return NULL;
}
int main(void)
{
struct counter total = { PTHREAD_MUTEX_INITIALIZER, 0UL };
struct timespec started, stopped;
double elapsed;
pthread_attr_t attrs;
pthread_t id[THREADS_PER_SET];
int i, k, result;
pthread_mutex_lock(&locked);
/* 64k stack per thread should suffice for real-world threads,
* assuming they allocate larger arrays dynamically,
* instead of relying on stack. */
pthread_attr_init(&attrs);
pthread_attr_setstacksize(&attrs, 65536);
/* Start timing. */
clock_gettime(CLOCK_REALTIME, &started);
for (k = 0; k < THREAD_SETS; k++) {
/* Create the threads. */
for (i = 0; i < THREADS_PER_SET; i++) {
result = pthread_create(&id[i], &attrs, worker, &total);
if (result) {
fprintf(stderr, "Error creating a new thread: %s.\n", strerror(result));
return 1;
}
}
/* Cancel the threads. */
for (i = 0; i < THREADS_PER_SET; i++)
pthread_cancel(id[i]);
/* Join the threads. */
for (i = 0; i < THREADS_PER_SET; i++) {
result = pthread_join(id[i], NULL);
if (result) {
fprintf(stderr, "Error joining a thread: %s.\n", strerror(result));
return 1;
}
}
}
pthread_attr_destroy(&attrs);
/* Stop timing. */
clock_gettime(CLOCK_REALTIME, &stopped);
elapsed = (double)(stopped.tv_sec - started.tv_sec)
+ (double)(stopped.tv_nsec - started.tv_nsec) / 1000000000.0;
printf("%lu threads (%d sets of %d threads) created in %.6f seconds real time,\n",
total.value, THREAD_SETS, THREADS_PER_SET, elapsed);
if (elapsed > 0.0)
printf("%.0f threads per second, %.9f seconds per thread on average.\n",
(double)total.value / elapsed, elapsed / (double)total.value);
if (failed)
printf("Note: %lu threads increased the failed variable.\n", failed);
return 0;
}
I still get over 50000 threads per second, or 0.00002 seconds real time overhead per thread. (All threads will block on the locked mutex, and be canceled asynchronously. No thread will actually get to the failed++ line.)
Is that really a performance issue for you? If so, I think you have divided the work into too small units.
(By the way, my first example program grew a bit too complex; I have to rewrite it. )