Skip to content

WritingEngineering

Most production engineering happens at the edges

After fifteen years of shipping software for other people, the parts that turn an engagement from "works" into "works in production" rarely look like features. They look like edges.

Erfan Besharat30 Apr 20266 min readEngineering · Production · Reliability

Every engagement starts the same way. There’s a feature spec, or a vision deck, or a wireframe Mojtaba and the design lead put together in a week. The visible work is clear. Build the page. Ship the endpoint. Wire the integration.

Then we get into it, and the actual engineering work shifts somewhere the spec doesn’t describe. Not the feature: the conditions around the feature. What happens when the upstream API rate-limits us for three minutes. What the dashboard says when the customer’s Stripe webhook arrives twice. Whether the cache invalidation is correct when an editor saves a draft, navigates away, and an admin publishes the same record.

These are the edges. Most production engineering happens here.

What I mean by an edge

An edge is a place where the system meets a condition the feature spec didn’t describe. Some are external: a vendor outage, a slow network, a clock skew. Some are internal: a race between two writers, a partial failure halfway through a transaction, a queue that backs up faster than the worker drains it.

Edges are not bugs. Bugs are mistakes inside features. Edges are the space between features, between systems, between this code and the world. Every production system has them; the question is whether you’ve named them, decided how to handle them, and put the decision in code where the next engineer can see it.

Why edges matter more than features

A feature that works in the demo is table stakes. A feature that works in production, on the worst day, is the differentiator. The customers who pay you don’t experience your feature list. They experience whether the page loads when their AppleScript bot is hammering it. Whether the email arrives even when Sendgrid is having a bad afternoon. Whether the export button fails loudly or quietly.

Most of the engineering work that ends up making a customer pay you for the next year is invisible from the feature list. It’s in how your system fails, how it recovers, how it tells you it failed, and how clean the cleanup is afterwards.

The edges I’ve seen pay for themselves

Concrete patterns we’ve put in production over the years and that have, every time, made the difference between an engagement that’s easy to support and one that isn’t:

  • Idempotency keys on every external call. If you can’t safely retry a request, you don’t actually own the integration; the vendor does.
  • Explicit timeouts everywhere. Defaults are usually fine until they’re catastrophic. The day a vendor stops responding instead of failing fast is the day your default timeouts decide whether your service stays up.
  • Failure-aware UI. Loading states that cap, error states that suggest a next action, optimistic updates that roll back when the server says no. Most production-feeling UI is just this.
  • Observability from day one. Not a dashboard-driven-development cult, but enough structured logging, tracing, and SLOs that on day 30 of an engagement you can answer “is this slow because of us, or because of them?”
  • Cache invalidation that someone has actually thought about. Not the first thing that worked. The pattern that handles staleness, partial writes, and concurrent updates by design.

Why the spec leaves them out

Feature specs are written by people thinking about the happy path because that’s the path they can describe. Failure modes are combinatoric, and combinatorics resist documentation. So the spec ships a beautiful picture of how the thing works, and the engineers spend most of their actual time on the part the spec didn’t cover.

This isn’t the spec’s fault. It’s the cost of doing business in software. The mistake is letting it be invisible work. The edges are where most of the engineering judgment lives, and treating them like overhead instead of like the actual product is the reason engineering teams burn out and the reason customers churn.

What we do about it

On every engagement we ship two artefacts that nobody asks for and every team needs:

  1. A short doc that names the edges. What can fail. What the fallback is. What the user sees. Updated as we go.
  2. A small set of runbooks. Three to five pages, written for the on-call engineer who isn’t us. The kind you can read at 3 a.m. and act on.

Neither artefact appears on a roadmap. Both are why the engagement feels different a year after we leave.

If you’re hiring an engineering team

Watch for the edges in the conversation. If a team is selling you velocity and never talks about how the thing fails, they’re selling you the demo, not the system. Ask how they handle partial failure. Ask about a time their thing broke at 3 a.m. and what they did about it. The good answers come naturally; the bad ones come from a script.

A senior team is a team that knows where the edges are before production teaches them.

Working on something similar?

Tell us about the work.

A scoping call is free, takes thirty to sixty minutes, and ends with a yes or no on whether we’re the right team.