Lesson 30 of 40 intermediate 7 min read

The performance ↔ productivity trade-off

Key takeaways Fast to run vs fast to write — the core tension behind every language choice. Productivity usually wins — developer time costs more than hardware, until it doesn’t. Real-time and embedded are the exception — you can’t add servers to a Raspberry Pi.

If you had to compress language choice to one axis, this would be it. Some languages are fast to run — they squeeze the most out of the hardware. Others are fast to write — they let a developer get a working thing done quickly. You rarely get the maximum of both, and most of choosing a language is deciding which one your problem actually rewards. This lesson lays out the trade-off and, just as importantly, where it doesn’t apply.

The two kinds of “fast”

When people say a language is “fast,” they usually mean one of two different things:

Fast to run — low CPU and memory cost, predictable timing. C, C++, Rust and Go sit here. They compile ahead of time to native code, give you control over memory, and run with little overhead.
Fast to write — fewer lines, less ceremony, quick iteration, easy to read. Python, JavaScript and Ruby live here. Dynamic typing, garbage collection and rich standard libraries let you express a lot with a little, at some run-time cost.

fast to RUN                                          fast to WRITE
 C  C++  Rust   ────   Go   ────   Java  C#   ────   JS  ────  Python  Ruby
 (control, speed)      (a strong middle)              (productivity, speed of dev)

This is a spectrum, not two boxes. Go is the interesting middle: compiled and genuinely fast, yet simple and quick to write — bought partly by accepting garbage collection and, historically, fewer features (it lacked generics until 2022). Java and C# also occupy a managed middle, with big ecosystems and JIT-driven speed. There’s no free lunch — every position pays for what it gains.

Productivity often beats raw speed — because of economics

Here’s the uncomfortable truth for performance purists: for most software, developer time is the scarce resource, not CPU cycles.

Hardware is cheap and elastic. If a web service is slow, you can often add a bigger server or another instance for a few dollars a month — far less than the cost of a developer-month spent shaving milliseconds.
Most programs aren’t CPU-bound. They wait on the network, the disk or the database. Making the language faster does nothing for time spent waiting.
Slow software that ships beats fast software that doesn’t. A product that reaches users in a productive language wins over a faster one that’s still being hand-optimised six months later.

So the default bias for ordinary software is toward productivity — and that’s a rational default, not laziness. You reach for raw speed when measurement, not intuition, says you need it.

“Make it work, make it right, make it fast”

This old maxim (Kent Beck) is the practical discipline behind the trade-off, and the order matters:

Make it work — get something correct and running, in whatever is quickest.
Make it right — clean it up, structure it, make it maintainable and tested.
Make it fast — only now, and only where you’ve measured a real bottleneck.

The trap it guards against is premature optimization — twisting code for speed before you know speed is even a problem. Premature optimization wastes effort, complicates code, and usually targets the wrong place anyway, because intuition about bottlenecks is notoriously bad. Profile first; optimise the proven hot path; leave the rest readable.

Productivity languages call into fast native code

The trade-off has a clever escape hatch that resolves much of the tension: a productive language can delegate the slow part to a fast one.

NumPy, pandas, SciPy give Python array math implemented in C and Fortran. Your loop-heavy number-crunching runs at native speed while you write ordinary Python.
PyTorch and TensorFlow push tensor math onto optimised C++ and GPU kernels.
GNU Radio drives C++ DSP blocks from a Python flowgraph.

The pattern is a thin, productive layer over a fast core. This is why “Python is slow” coexists with “Python dominates data science” — the slow interpreter never touches the hot loop. When you can structure your problem this way, you often get most of the productivity and most of the speed. The catch: it only works when the expensive work can be handed off in big chunks. Fine-grained, per-sample logic that can’t be vectorised away gets no benefit, and pays full interpreter tax.

Quick check: how does Python achieve good performance for data science despite being slow itself?

When you can’t just add servers

The “hardware is cheap, optimise later” logic has hard limits, and missing them is a classic mistake. Some problems can’t be solved by throwing more machines at them:

Real-time deadlines. A control loop or audio pipeline must finish each cycle on time. A second server doesn’t help a single thread that misses its deadline; you need the work to be fast, with predictable timing and no surprise pauses.
Embedded and constrained hardware. A microcontroller or a Raspberry Pi has the CPU and RAM it has. You cannot scale out — you must fit the budget you’re given.
Radio and DSP. Software-defined radio produces a relentless sample stream; each buffer must be filtered and demodulated before the next arrives or samples drop. That’s a per-buffer deadline on finite hardware — exactly where raw speed and predictability stop being optional. Python is excellent for prototyping DSP, but the production inner loop usually lives in C, C++, Rust or carefully written Go.

In these domains the trade-off flips: productivity is still nice, but performance is a hard constraint, and you choose a language that can meet it. GopherTrunk lives in this world — see the RF & SDR path for where the deadlines come from. The point is not “compiled good, interpreted bad”; it’s know which side of the line your problem is on before you let economics pick for you.

A balanced default

Put together, a sane default policy looks like this:

Bias to productivity for ordinary, I/O-bound, scalable software — and measure before optimising.
Bias to performance when you have a real, proven CPU-bound or real-time constraint on hardware you can’t simply grow.
Mix both when you can: a productive shell around a fast core.

Recap

Two kinds of fast — fast to run (C, C++, Rust, Go) and fast to write (Python, JS, Ruby); Go and the managed languages sit in between.
Productivity is the usual default — developer time costs more than hardware, and most software isn’t CPU-bound.
Make it work, make it right, make it fast — optimise last, only where profiling proves it matters, to avoid premature optimization.
Productive languages delegate — NumPy and friends run the heavy work in native code, recovering much of the speed.
Real-time and embedded flip the rule — when you can’t add servers, performance becomes a hard constraint, as it is in radio DSP.

Next up: a fair, even-handed survey of the major languages you’ll actually choose between — a tour of today’s major languages.

Frequently asked questions

Are compiled languages always the right choice for performance?

Only when performance is actually the binding constraint. For most software the bottleneck is the network, the database or developer time — not the language’s raw speed. Compiled languages shine when you have a genuine CPU-bound, latency-sensitive or resource-constrained problem, like real-time DSP on limited hardware.

If Python is slow, why is it so popular for data science and ML?

Because the slow part is delegated. Libraries like NumPy, pandas and PyTorch are thin Python interfaces over highly optimised C, C++, Fortran and GPU code. You write productive Python, but the heavy number-crunching runs in fast native code — so you get most of the productivity and most of the speed.

What does "premature optimization" actually mean?

Optimising code for speed before you know it matters — guessing at bottlenecks instead of measuring them. It wastes effort and complicates code that was fine as it was. The discipline is to make it work, make it right, then make it fast only where profiling proves it’s needed.