Gaurav Sarma

2026 update. The original version of this post compared three concurrency shapes by timing fixed batches of 1M / 10M / 100M / 1B operations and concluded "always pre-allocate." After re-running the experiment with Go's testing.B framework — and adding a variant that doesn't use a channel — the conclusion I drew was wrong, or at least imprecise. The code and numbers below are the corrected version. Original tables left at the bottom for reference. Source: github.com/gsarmaonline/golang-measurements.

Developers who learn Go are taught that goroutines are a very cheap version of threads. The minimum stack is 2 kB as of Go 1.19, and the standard library leans on this — net/http spawns a goroutine per accepted connection without apology. So is there ever a reason to not spawn one per request?

The original version of this post said yes: pre-allocate a pool of goroutines and reuse them, because it's "more than 2× faster." That number is real, but it's the answer to a different question than the one I was asking.

Why the original setup was misleading

The work unit looked like this:

func calculateSum(outputCh chan int, a, b int) {
    outputCh <- a + b
}

The function does an integer add and writes the result to a channel. The benchmark was: spawn N of these, time the wall clock. The problem: the channel send is a synchronization operation. Every iteration was paying for goroutine creation and a channel send + receive. When I compared "spawn per request" against "pool of 10 workers," I was comparing two shapes that both used channels — so I never saw how much of the gap was channels and how much was goroutines.

The second problem was using fixed N for timing. testing.B auto-tunes the iteration count until the run is long enough to measure reliably, and reports ns/op directly — no spreadsheet arithmetic, no "how do I divide 435,440ms by 1,000,000,000 again." It also lets you compare against a serial baseline trivially, which turns out to matter a lot here.

The corrected setup

Five shapes, same a + b work unit, all driven by testing.B:

Shape	What it measures
`Serial`	Baseline. No goroutines.
`SpawnPerRequestChan`	One goroutine per call, result returned via channel. Original post's "approach 1."
`SpawnPerRequestWG`	One goroutine per call, sync via `sync.WaitGroup` + `atomic.AddUint64`. Channel bottleneck removed.
`PoolSharedChan_N`	Pre-allocated pool of N workers reading from one channel. Original post's "approach 2."
`PoolPerWorkerChan_N`	Pre-allocated pool, one channel per worker. Original post's "approach 3."

Full code at github.com/gsarmaonline/golang-measurements. Run with go test -bench=. -benchmem -benchtime=3s.

The numbers

Apple M4, Go 1.24, benchtime=3s, median of three runs.

Shape	Pool size	ns/op	B/op	allocs/op
`Serial`	—	0.23	0	0
`SpawnPerRequestChan`	—	280	160	3
`SpawnPerRequestWG`	—	131	56	2
`PoolSharedChan`	10 (= GOMAXPROCS)	96	0	0
`PoolSharedChan`	100	193	0	0
`PoolSharedChan`	1000	244	0	0
`PoolPerWorkerChan`	100	74	0	0

A few things jump out:

The actual work is 0.23 ns. Everything above that is overhead. We are measuring scheduler and synchronization costs, not arithmetic.
Removing the channel from the no-pool case nearly halves the cost. SpawnPerRequestChan is 280 ns; SpawnPerRequestWG is 131 ns. So in the original post, more than half of what I was calling "the cost of spawning a goroutine" was actually the cost of the channel send.
A 1000-worker pool (244 ns) is slower than no pool at all + atomics (131 ns). Pre-allocation isn't a free win.
The right-sized pool wins on allocations. PoolSharedChan_10 is 0 B/op, 0 allocs/op — the workers and channel exist for the lifetime of the benchmark. The spawn-per-request shapes allocate 56–160 B per iteration. For GC-sensitive workloads that's often a bigger deal than the ns/op.

What's actually going on

Pre-allocation isn't one thing — it's two things bundled together:

It amortizes goroutine creation across many work items.
It bounds the number of goroutines that exist concurrently.

The second one is doing most of the work. A 1000-goroutine pool gives you the first benefit (no spawn per request) without the second, and you can see it in the table: at 1000 workers the pool is barely better than spawning per request, and worse than the no-pool WaitGroup variant. The Go scheduler is multiplexing 1000 runnable goroutines onto ~10 OS threads, and that bookkeeping costs more than just doing the work serially across 10 long-lived workers.

The real rule, then, is: keep the concurrent goroutine count near GOMAXPROCS for CPU-bound work. Pre-allocation is one way to do that. So is a bounded semaphore over spawn-per-request. So is dispatching N requests across N workers manually. The strategy doesn't matter much; the count does.

So why is fasthttp faster than net/http?

This is the question worth asking, because it's where the microbenchmark becomes practical. net/http does go c.serve(...) per accepted connection — exactly the pattern this post was nominally about. fasthttp uses a worker pool. So fasthttp must be winning because it pre-allocates goroutines, right?

A little, but not mostly. In rough order of impact:

fasthttp.RequestCtx is sync.Pool-recycled. Every request in net/http allocates a fresh *http.Request, *http.Response, headers map, parsed URL parts — 20+ heap objects. fasthttp hands one *RequestCtx back to a pool when the handler returns. This is the same effect you see in the benchmark above when PoolSharedChan_10 reports 0 B/op while SpawnPerRequestChan reports 160 B/op — just scaled up to a full request lifecycle.
[]byte instead of string in the API. ctx.Path(), ctx.Method(), header values all return slices that point into the read buffer. net/http's string-returning API forces a copy on every access, because strings are immutable.
Hand-rolled header parser, no textproto.MIMEHeader map allocation per request.
Worker pool for connections. This is the part the benchmark in this post is actually about. It's worth something — call it ~150 ns + 2 kB stack per request you don't spawn — but it's the smallest of the four.

If you took net/http and swapped only the connection-handling for a worker pool, you'd recover maybe 10–15% of the gap to fasthttp. The other 85% is sync.Pool on everything reusable plus an API that doesn't force copies. The goroutine pool gets the headlines because it's visually obvious; the object pooling does the work.

What I'd say now

Three things:

Measure with testing.B, not wall clocks on fixed N. It auto-scales and gives you per-op numbers that are directly comparable across shapes. Five minutes saved per experiment, plus you can't accidentally publish a result that depended on what else your laptop was doing that afternoon.
When you compare two concurrency shapes, hold the synchronization primitive constant. Otherwise you're measuring two things at once and you won't know which one moved.
Pre-allocation is the wrong frame; bounded concurrency is the right frame. And once you've bounded the concurrency, the next-biggest lever is almost always pooling the objects the goroutines touch, not the goroutines themselves.

Appendix: original numbers (Apple M1 Max Pro, 2023)

Kept for reference; the takeaways from these are partially correct (more workers eventually hurts) and partially misleading (pre-allocation framed as the cause of the win).

Spawn per request: 1B → 435,440 ms · 100M → 43,319 ms · 10M → 4,232 ms · 1M → 444 ms

Pool of 10 (shared channel): 1B → 168,123 ms · 100M → 16,644 ms · 10M → 1,738 ms · 1M → 167 ms

Pool of 100 (shared channel): 1B → 253,336 ms · 100M → 25,689 ms · 10M → 2,545 ms · 1M → 273 ms

Pool of 1000 (shared channel): 1B → 329,467 ms · 100M → 32,022 ms · 10M → 3,165 ms · 1M → 386 ms

Pool of 100 (per-worker channels): 100M → 23,016 ms · 10M → 2,138 ms · 1M → 228 ms

Pool of 1000 (per-worker channels): 100M → 27,504 ms · 10M → 2,920 ms · 1M → 310 ms