An Experiment-Driven Approach
Multi-threading is not automatically faster. Creating and joining threads has measurable
overhead on the order of 100 µs per thread on typical hardware — enough to swamp a short
computation entirely. Using matrix multiplication as a test case, we measure this overhead
directly with pthreads, derive a closed-form crossover condition, and build
a "smart" multiplier that automatically selects the faster path. An interactive calculator
lets you find your own machine's crossover point.
When you hear "multithreading," you probably think "faster." The logic seems airtight: more CPU cores running in parallel should finish sooner. But there is a hidden cost — the operating system has to create a thread, schedule it, and then join it back. For short tasks, this overhead dominates and the parallel version ends up slower than the serial one.
This article answers the question quantitatively. We will implement matrix multiplication in C, naively parallelize it, watch the performance collapse, measure the overhead, derive the exact break-even condition, and build something smarter.
Prerequisites: C programming, basic linear algebra, a POSIX system
with pthreads, and a C compiler.
We represent a matrix as a struct with row/column counts and a flat double*
array stored in row-major order. Using a 1-D array instead of double** needs
only a single allocation and keeps the indexing arithmetic simple.
double** requires N+1 separate malloc calls — one
for the pointer array and one per row — and produces fragmented memory. A flat
double* does it in one call. Row-major indexing is just
i * n_cols + j.
typedef struct {
int n_rows;
int n_cols;
double *data;
} Matrix;
Matrix *mat_create(int n_rows, int n_cols) {
Matrix *m = malloc(sizeof(Matrix));
m->n_rows = n_rows;
m->n_cols = n_cols;
m->data = malloc(n_rows * n_cols * sizeof(double));
return m;
}
void mat_destroy(Matrix *m) {
free(m->data);
free(m);
}
double mat_get(Matrix *m, int i, int j) { return m->data[i * m->n_cols + j]; }
void mat_set(Matrix *m, int i, int j, double v) { m->data[i * m->n_cols + j] = v; }
Matrix multiplication
We factor out the inner dot product into its own function. That decision will pay off when we parallelize:
/* Compute C[i][j] = dot( A[i,:], B[:,j] ) */
double matmul_subroutine(Matrix *A, Matrix *B, int i, int j) {
double sum = 0.0;
for (int k = 0; k < A->n_cols; k++)
sum += mat_get(A, i, k) * mat_get(B, k, j);
return sum;
}
Matrix *matmul(Matrix *A, Matrix *B) {
Matrix *C = mat_create(A->n_rows, B->n_cols);
for (int i = 0; i < A->n_rows; i++)
for (int j = 0; j < B->n_cols; j++)
mat_set(C, i, j, matmul_subroutine(A, B, i, j));
return C;
}
The parallel structure is obvious: every one of the
#include <pthread.h>
typedef struct { Matrix *A, *B, *C; int i, j; } ThreadParams;
void *thread_fn(void *args) {
ThreadParams *p = (ThreadParams *)args;
mat_set(p->C, p->i, p->j, matmul_subroutine(p->A, p->B, p->i, p->j));
return NULL;
}
Matrix *matmul_parallel_naive(Matrix *A, Matrix *B) {
int S = A->n_rows * B->n_cols;
Matrix *C = mat_create(A->n_rows, B->n_cols);
pthread_t threads[S];
ThreadParams params[S];
int t = 0;
for (int i = 0; i < A->n_rows; i++)
for (int j = 0; j < B->n_cols; j++) {
params[t] = (ThreadParams){A, B, C, i, j};
pthread_create(&threads[t], NULL, thread_fn, ¶ms[t]);
t++;
}
for (int i = 0; i < S; i++)
pthread_join(threads[i], NULL);
return C;
}
Benchmark on a 100 × 100 matrix product (10,000 dot products):
| Implementation | Time (s) | Relative |
|---|---|---|
matmul (serial) |
0.000172 | 1× |
matmul_parallel_naive |
0.003134 | 18× slower |
The parallel version is eighteen times slower. The threads are not helping — they are costing us. The computation is so short that the overhead of spawning and joining 10,000 threads dwarfs the actual work.
To confirm the hypothesis, we measure the cost of a pthread_create +
pthread_join pair directly, using clock_gettime for
wall-clock time.
clock() measures CPU time, which accumulates across all threads and
would give a misleadingly large number in a multi-threaded context.
clock_gettime(CLOCK_MONOTONIC, ...) gives wall-clock elapsed time,
which is what the user actually waits for.
#include <time.h>
void *noop(void *arg) { return NULL; }
double measure_thread_overhead(void) {
struct timespec t0, t1;
const int N = 1000;
pthread_t thread;
clock_gettime(CLOCK_MONOTONIC, &t0);
for (int i = 0; i < N; i++) {
pthread_create(&thread, NULL, noop, NULL);
pthread_join(thread, NULL);
}
clock_gettime(CLOCK_MONOTONIC, &t1);
double elapsed = (t1.tv_sec - t0.tv_sec) +
(t1.tv_nsec - t0.tv_nsec) * 1e-9;
return elapsed / N; /* amortized cost per create+join */
}
On the test machine this yields
Let's be precise. Define:
matmul_subroutine callAssuming perfect load balancing across cores:
Setting
Notice that
On a 4-core machine with
We benchmark matmul_subroutine across inner dimensions to find where
| Inner dim |
Time |
Regime |
|---|---|---|
| 100 | 0.0000010 | serial wins by 133× |
| 1,000 | 0.0000040 | serial wins by 33× |
| 10,000 | 0.0000350 | serial wins by 3.8× |
| ~27,000 | ~0.000133 | ↔ crossover |
| 100,000 | 0.0004900 | parallel wins |
| 1,000,000 | 0.0038400 | parallel wins |
| 10,000,000 | 0.0313800 | parallel wins |
| 100,000,000 | 0.3031080 | parallel wins |
The scaling is linear in
matmul_subroutine call vs. inner dimension
Measure
The empirical crossover (~27,000) agrees reasonably with the theoretical prediction.
The gap arises from our simplified overhead model — cache effects, scheduler
latency, and contention on thread-local storage also contribute to
Now we can build an adaptive implementation. Check the inner dimension at runtime and dispatch accordingly:
/* Empirically tuned for this machine; re-measure h and recompute for yours. */
#define CROSSOVER_N 27000
Matrix *matmul_smart(Matrix *A, Matrix *B) {
if (A->n_cols < CROSSOVER_N)
return matmul(A, B); /* serial: overhead would dominate */
else
return matmul_parallel_naive(A, B); /* parallel: computation dominates */
}
Benchmarked across 42 matrix configurations with increasing inner dimension, each
averaged over 30 runs: matmul_smart matches serial performance for small
Multi-threading is not always the right tool. Thread creation and joining introduce
overhead on the order of 100 µs per thread on typical hardware. The right question
when parallelizing a routine is not "can I thread this?" but
"does the per-unit computation time exceed
Measure your overhead, compute the threshold, hard-code the constant, and let the code decide.