2026 update. The original version of this post compared three concurrency shapes by timing fixed batches of 1M / 10M / 100M / 1B operations and concluded "always pre-allocate." After re-running the experiment with Go's
testing.Bframework — and adding a variant that doesn't use a channel — the conclusion I drew was wrong, or at least imprecise. The code and numbers below are the corrected version. Original tables left at the bottom for reference. Source: github.com/gsarmaonline/golang-measurements.
Developers who learn Go are taught that goroutines are a very cheap version of threads. The minimum stack is 2 kB as of Go 1.19, and the standard library leans on this — net/http spawns a goroutine per accepted connection without apology. So is there ever a reason to not spawn one per request?
The original version of this post said yes: pre-allocate a pool of goroutines and reuse them, because it's "more than 2× faster." That number is real, but it's the answer to a different question than the one I was asking.
Why the original setup was misleading
The work unit looked like this:
func calculateSum(outputCh chan int, a, b int) {
outputCh <- a + b
}
The function does an integer add and writes the result to a channel. The benchmark was: spawn N of these, time the wall clock. The problem: the channel send is a synchronization operation. Every iteration was paying for goroutine creation and a channel send + receive. When I compared "spawn per request" against "pool of 10 workers," I was comparing two shapes that both used channels — so I never saw how much of the gap was channels and how much was goroutines.
The second problem was using fixed N for timing. testing.B auto-tunes the iteration count until the run is long enough to measure reliably, and reports ns/op directly — no spreadsheet arithmetic, no "how do I divide 435,440ms by 1,000,000,000 again." It also lets you compare against a serial baseline trivially, which turns out to matter a lot here.
The corrected setup
Five shapes, same a + b work unit, all driven by testing.B:
| Shape | What it measures |
|---|---|
Serial |
Baseline. No goroutines. |
SpawnPerRequestChan |
One goroutine per call, result returned via channel. Original post's "approach 1." |
SpawnPerRequestWG |
One goroutine per call, sync via sync.WaitGroup + atomic.AddUint64. Channel bottleneck removed. |
PoolSharedChan_N |
Pre-allocated pool of N workers reading from one channel. Original post's "approach 2." |
PoolPerWorkerChan_N |
Pre-allocated pool, one channel per worker. Original post's "approach 3." |
Full code at github.com/gsarmaonline/golang-measurements. Run with go test -bench=. -benchmem -benchtime=3s.
The numbers
Apple M4, Go 1.24, benchtime=3s, median of three runs.
| Shape | Pool size | ns/op | B/op | allocs/op |
|---|---|---|---|---|
Serial |
— | 0.23 | 0 | 0 |
SpawnPerRequestChan |
— | 280 | 160 | 3 |
SpawnPerRequestWG |
— | 131 | 56 | 2 |
PoolSharedChan |
10 (= GOMAXPROCS) | 96 | 0 | 0 |
PoolSharedChan |
100 | 193 | 0 | 0 |
PoolSharedChan |
1000 | 244 | 0 | 0 |
PoolPerWorkerChan |
100 | 74 | 0 | 0 |
A few things jump out:
- The actual work is 0.23 ns. Everything above that is overhead. We are measuring scheduler and synchronization costs, not arithmetic.
- Removing the channel from the no-pool case nearly halves the cost.
SpawnPerRequestChanis 280 ns;SpawnPerRequestWGis 131 ns. So in the original post, more than half of what I was calling "the cost of spawning a goroutine" was actually the cost of the channel send. - A 1000-worker pool (244 ns) is slower than no pool at all + atomics (131 ns). Pre-allocation isn't a free win.
- The right-sized pool wins on allocations.
PoolSharedChan_10is 0 B/op, 0 allocs/op — the workers and channel exist for the lifetime of the benchmark. The spawn-per-request shapes allocate 56–160 B per iteration. For GC-sensitive workloads that's often a bigger deal than the ns/op.
What's actually going on
Pre-allocation isn't one thing — it's two things bundled together:
- It amortizes goroutine creation across many work items.
- It bounds the number of goroutines that exist concurrently.
The second one is doing most of the work. A 1000-goroutine pool gives you the first benefit (no spawn per request) without the second, and you can see it in the table: at 1000 workers the pool is barely better than spawning per request, and worse than the no-pool WaitGroup variant. The Go scheduler is multiplexing 1000 runnable goroutines onto ~10 OS threads, and that bookkeeping costs more than just doing the work serially across 10 long-lived workers.
The real rule, then, is: keep the concurrent goroutine count near GOMAXPROCS for CPU-bound work. Pre-allocation is one way to do that. So is a bounded semaphore over spawn-per-request. So is dispatching N requests across N workers manually. The strategy doesn't matter much; the count does.
So why is fasthttp faster than net/http?
This is the question worth asking, because it's where the microbenchmark becomes practical. net/http does go c.serve(...) per accepted connection — exactly the pattern this post was nominally about. fasthttp uses a worker pool. So fasthttp must be winning because it pre-allocates goroutines, right?
A little, but not mostly. In rough order of impact:
fasthttp.RequestCtxissync.Pool-recycled. Every request innet/httpallocates a fresh*http.Request,*http.Response, headers map, parsed URL parts — 20+ heap objects. fasthttp hands one*RequestCtxback to a pool when the handler returns. This is the same effect you see in the benchmark above whenPoolSharedChan_10reports 0 B/op whileSpawnPerRequestChanreports 160 B/op — just scaled up to a full request lifecycle.[]byteinstead ofstringin the API.ctx.Path(),ctx.Method(), header values all return slices that point into the read buffer.net/http's string-returning API forces a copy on every access, because strings are immutable.- Hand-rolled header parser, no
textproto.MIMEHeadermap allocation per request. - Worker pool for connections. This is the part the benchmark in this post is actually about. It's worth something — call it ~150 ns + 2 kB stack per request you don't spawn — but it's the smallest of the four.
If you took net/http and swapped only the connection-handling for a worker pool, you'd recover maybe 10–15% of the gap to fasthttp. The other 85% is sync.Pool on everything reusable plus an API that doesn't force copies. The goroutine pool gets the headlines because it's visually obvious; the object pooling does the work.
What I'd say now
Three things:
- Measure with
testing.B, not wall clocks on fixed N. It auto-scales and gives you per-op numbers that are directly comparable across shapes. Five minutes saved per experiment, plus you can't accidentally publish a result that depended on what else your laptop was doing that afternoon. - When you compare two concurrency shapes, hold the synchronization primitive constant. Otherwise you're measuring two things at once and you won't know which one moved.
- Pre-allocation is the wrong frame; bounded concurrency is the right frame. And once you've bounded the concurrency, the next-biggest lever is almost always pooling the objects the goroutines touch, not the goroutines themselves.
Appendix: original numbers (Apple M1 Max Pro, 2023)
Kept for reference; the takeaways from these are partially correct (more workers eventually hurts) and partially misleading (pre-allocation framed as the cause of the win).
Spawn per request: 1B → 435,440 ms · 100M → 43,319 ms · 10M → 4,232 ms · 1M → 444 ms
Pool of 10 (shared channel): 1B → 168,123 ms · 100M → 16,644 ms · 10M → 1,738 ms · 1M → 167 ms
Pool of 100 (shared channel): 1B → 253,336 ms · 100M → 25,689 ms · 10M → 2,545 ms · 1M → 273 ms
Pool of 1000 (shared channel): 1B → 329,467 ms · 100M → 32,022 ms · 10M → 3,165 ms · 1M → 386 ms
Pool of 100 (per-worker channels): 100M → 23,016 ms · 10M → 2,138 ms · 1M → 228 ms
Pool of 1000 (per-worker channels): 100M → 27,504 ms · 10M → 2,920 ms · 1M → 310 ms