Should You Multithread?

An Experiment-Driven Approach

Multi-threading is not automatically faster. Creating and joining threads has measurable overhead on the order of 100 µs per thread on typical hardware — enough to swamp a short computation entirely. Using matrix multiplication as a test case, we measure this overhead directly with pthreads, derive a closed-form crossover condition, and build a "smart" multiplier that automatically selects the faster path. An interactive calculator lets you find your own machine's crossover point.

← Back to blog

The Assumption

When you hear "multithreading," you probably think "faster." The logic seems airtight: more CPU cores running in parallel should finish sooner. But there is a hidden cost — the operating system has to create a thread, schedule it, and then join it back. For short tasks, this overhead dominates and the parallel version ends up slower than the serial one.

This article answers the question quantitatively. We will implement matrix multiplication in C, naively parallelize it, watch the performance collapse, measure the overhead, derive the exact break-even condition, and build something smarter.

Prerequisites: C programming, basic linear algebra, a POSIX system with pthreads, and a C compiler.

Implementing Matrix Multiplication

We represent a matrix as a struct with row/column counts and a flat double* array stored in row-major order. Using a 1-D array instead of double** needs only a single allocation and keeps the indexing arithmetic simple. A double** requires N+1 separate malloc calls — one for the pointer array and one per row — and produces fragmented memory. A flat double* does it in one call. Row-major indexing is just i * n_cols + j.

typedef struct {
    int     n_rows;
    int     n_cols;
    double *data;
} Matrix;

Matrix *mat_create(int n_rows, int n_cols) {
    Matrix *m  = malloc(sizeof(Matrix));
    m->n_rows  = n_rows;
    m->n_cols  = n_cols;
    m->data    = malloc(n_rows * n_cols * sizeof(double));
    return m;
}

void mat_destroy(Matrix *m) {
    free(m->data);
    free(m);
}

double mat_get(Matrix *m, int i, int j) { return m->data[i * m->n_cols + j]; }
void   mat_set(Matrix *m, int i, int j, double v) { m->data[i * m->n_cols + j] = v; }

Matrix multiplication C = AB requires computing c_{ij} = \sum_{k=1}^{K} a_{ik}\, b_{kj} for each entry, where K is the shared (inner) dimension. This is O(K) per entry and O(MNK) overall for an M \times K times K \times N product.

We factor out the inner dot product into its own function. That decision will pay off when we parallelize:

/* Compute C[i][j] = dot( A[i,:], B[:,j] ) */
double matmul_subroutine(Matrix *A, Matrix *B, int i, int j) {
    double sum = 0.0;
    for (int k = 0; k < A->n_cols; k++)
        sum += mat_get(A, i, k) * mat_get(B, k, j);
    return sum;
}

Matrix *matmul(Matrix *A, Matrix *B) {
    Matrix *C = mat_create(A->n_rows, B->n_cols);
    for (int i = 0; i < A->n_rows; i++)
        for (int j = 0; j < B->n_cols; j++)
            mat_set(C, i, j, matmul_subroutine(A, B, i, j));
    return C;
}

Naïve Multithreading: One Thread per Dot Product

The parallel structure is obvious: every one of the S = M \times N dot products is independent. Spawn one thread per entry, join them all at the end.

#include <pthread.h>

typedef struct { Matrix *A, *B, *C; int i, j; } ThreadParams;

void *thread_fn(void *args) {
    ThreadParams *p = (ThreadParams *)args;
    mat_set(p->C, p->i, p->j, matmul_subroutine(p->A, p->B, p->i, p->j));
    return NULL;
}

Matrix *matmul_parallel_naive(Matrix *A, Matrix *B) {
    int S = A->n_rows * B->n_cols;
    Matrix    *C      = mat_create(A->n_rows, B->n_cols);
    pthread_t  threads[S];
    ThreadParams params[S];

    int t = 0;
    for (int i = 0; i < A->n_rows; i++)
        for (int j = 0; j < B->n_cols; j++) {
            params[t] = (ThreadParams){A, B, C, i, j};
            pthread_create(&threads[t], NULL, thread_fn, &params[t]);
            t++;
        }
    for (int i = 0; i < S; i++)
        pthread_join(threads[i], NULL);
    return C;
}

Benchmark on a 100 × 100 matrix product (10,000 dot products):

Implementation	Time (s)	Relative
`matmul` (serial)	0.000172	1×
`matmul_parallel_naive`	0.003134	18× slower

The parallel version is eighteen times slower. The threads are not helping — they are costing us. The computation is so short that the overhead of spawning and joining 10,000 threads dwarfs the actual work.

Measuring Thread Overhead

To confirm the hypothesis, we measure the cost of a pthread_create + pthread_join pair directly, using clock_gettime for wall-clock time. clock() measures CPU time, which accumulates across all threads and would give a misleadingly large number in a multi-threaded context. clock_gettime(CLOCK_MONOTONIC, ...) gives wall-clock elapsed time, which is what the user actually waits for.

#include <time.h>

void *noop(void *arg) { return NULL; }

double measure_thread_overhead(void) {
    struct timespec t0, t1;
    const int N = 1000;
    pthread_t thread;

    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (int i = 0; i < N; i++) {
        pthread_create(&thread, NULL, noop, NULL);
        pthread_join(thread, NULL);
    }
    clock_gettime(CLOCK_MONOTONIC, &t1);

    double elapsed = (t1.tv_sec  - t0.tv_sec) +
                     (t1.tv_nsec - t0.tv_nsec) * 1e-9;
    return elapsed / N;   /* amortized cost per create+join */
}

On the test machine this yields h \approx 0.0001 seconds (100 µs) per thread. For a 100 × 100 product that spawns 10,000 threads, the overhead alone is roughly one second — far more than the 172 µs serial time.

The Break-Even Condition

Let's be precise. Define:

S = M \times N — total number of dot products
T — time for a single matmul_subroutine call
C — number of CPU cores
h — pthread overhead per thread (create + join)

Assuming perfect load balancing across cores:

T_{\text{serial}} = S \cdot T

T_{\text{parallel}} = \frac{S \cdot T}{C} + S \cdot h

Setting T_{\text{parallel}} \lt T_{\text{serial}} and solving for T:

\frac{T}{C} + h \lt T \;\Longrightarrow\; h \lt T\!\left(1 - \tfrac{1}{C}\right) \;\Longrightarrow\; \boxed{T^* = \frac{h \cdot C}{C - 1}}

Notice that S cancels — matrix size does not determine whether threading pays off. The only thing that matters is how long a single dot product takes, which depends solely on the inner dimension K.

On a 4-core machine with h = 10^{-4} s: T^* = 10^{-4} \cdot 4/3 \approx 1.33 \times 10^{-4} s. Threading only wins when a single dot product takes longer than 133 µs.

Measuring the Subroutine

We benchmark matmul_subroutine across inner dimensions to find where T crosses T^*:

Inner dim n	Time T (s)	Regime
100	0.0000010	serial wins by 133×
1,000	0.0000040	serial wins by 33×
10,000	0.0000350	serial wins by 3.8×
~27,000	~0.000133	↔ crossover
100,000	0.0004900	parallel wins
1,000,000	0.0038400	parallel wins
10,000,000	0.0313800	parallel wins
100,000,000	0.3031080	parallel wins

The scaling is linear in n as expected, confirming O(n) behavior. The empirical crossover sits near n \approx 27{,}000.

Execution time of a single matmul_subroutine call vs. inner dimension n, plotted on log-log axes. The red dashed line marks T^* — threading pays off to the right. Adjust the sliders below to move the threshold for your hardware.

Interactive: Find Your Crossover Point

Measure h on your machine using the overhead measurement code above, then tune the slider to see where threading starts to pay off.

The empirical crossover (~27,000) agrees reasonably with the theoretical prediction. The gap arises from our simplified overhead model — cache effects, scheduler latency, and contention on thread-local storage also contribute to h in practice.

A Smarter Matrix Multiply

Now we can build an adaptive implementation. Check the inner dimension at runtime and dispatch accordingly:

/* Empirically tuned for this machine; re-measure h and recompute for yours. */
#define CROSSOVER_N 27000

Matrix *matmul_smart(Matrix *A, Matrix *B) {
    if (A->n_cols < CROSSOVER_N)
        return matmul(A, B);                /* serial: overhead would dominate */
    else
        return matmul_parallel_naive(A, B); /* parallel: computation dominates */
}

Benchmarked across 42 matrix configurations with increasing inner dimension, each averaged over 30 runs: matmul_smart matches serial performance for small n and tracks the parallel speedup for large n, staying below the extrapolated serial curve throughout.

Takeaway

Multi-threading is not always the right tool. Thread creation and joining introduce overhead on the order of 100 µs per thread on typical hardware. The right question when parallelizing a routine is not "can I thread this?" but "does the per-unit computation time exceed hC/(C-1)?"

Measure your overhead, compute the threshold, hard-code the constant, and let the code decide.