NearIRM Team
NearIRM Team5 min read

Feature Flags as Kill Switches: A Practical Guide for On-Call Engineers

When a deployment causes an incident, your first instinct is to roll back. But rollbacks go through the same CI/CD pipeline that got you into trouble. That means 10 to 30 minutes of waiting while your p99 latency climbs, your error budget burns, and your on-call slack channel fills up with questions.

Feature flags, used specifically as kill switches, give you a faster path. They let you disable a broken feature in seconds without touching a pipeline.

This post covers how to set them up, how to name and organize them, and how to wire them into your incident response runbooks.

Kill Switches vs. Feature Flags

Most teams use feature flags for gradual rollouts and A/B tests. Kill switches are a specific type of flag with one job: turning things off in an emergency.

The distinction matters because the two have different properties:

Gradual rollout flagKill switch
Who controls itProduct/engineeringOn-call engineer
Change latencySecondsSeconds
PurposeVary behaviorDisable behavior
LifecycleDays to weeksPermanent until removed
Audience targeting% of users or cohortsAll users (on/off)

Kill switches don't need percentage rollouts or user targeting. They're binary. That simplicity is the point.

What to Wrap in a Kill Switch

Not every feature needs one. You should add kill switches to:

  • New code paths in high-traffic flows (checkout, login, search)
  • Third-party integrations that could be flaky (payment processors, SMS providers, external enrichment APIs)
  • Background jobs that could overload your database or queue
  • Expensive or experimental ML inference calls
  • New rate limiting or auth behavior where a bug locks users out

A good heuristic: if a bug in this feature would trigger a Sev-1 or Sev-2, it should have a kill switch at launch.

Naming Conventions

Naming is where most teams make a mess. After six months you end up with flags like new_checkout_v2, enable_payment_retry, and legacy_auth_fallback, and no one knows which ones are kill switches and which are permanent config.

A simple prefix solves this:

kill.checkout.new-flow
kill.payments.stripe-v2
kill.search.ml-reranking
kill.jobs.email-digest

The kill. prefix means everyone (including an on-call engineer at 3am) immediately understands what toggling the flag does. Group by service or domain, then feature.

Keep a separate namespace for experiments:

exp.checkout.button-color-test
cfg.search.result-limit

This lets your runbooks say "turn off kill.* flags for the checkout service" without ambiguity.

Implementing a Kill Switch

The implementation is intentionally simple. You want fast reads, no external dependencies during evaluation, and a clear default.

Here's a minimal Python example using a flag client backed by Redis or LaunchDarkly:

import flagsmith  # or launchdarkly_api, or your own Redis client

flags = flagsmith.Flagsmith(environment_key=os.environ["FLAG_ENV_KEY"])

def is_kill_switch_active(flag_name: str) -> bool:
    # Default to False (feature ON) when flag service is unavailable
    try:
        return flags.is_feature_enabled(flag_name)
    except Exception:
        return False  # fail open

def process_order(order):
    if is_kill_switch_active("kill.checkout.new-flow"):
        return legacy_process_order(order)  # safe fallback path
    return new_process_order(order)

Two things to get right here:

  1. Fail open by default. If your flag service goes down, return False so the feature stays on. The alternative (failing closed) means your flag service outage also disables your features. That's a bad day.

  2. Keep a fallback path. A kill switch is only useful if there's something to fall back to. If you can't disable a feature without breaking the whole flow, you don't have a kill switch, you have a time bomb.

In Node/TypeScript:

async function isKilled(flag: string): Promise<boolean> {
  try {
    return await flagClient.isEnabled(flag);
  } catch {
    return false; // fail open
  }
}

export async function handleSearch(query: string) {
  if (await isKilled("kill.search.ml-reranking")) {
    return lexicalSearch(query);
  }
  return mlSearch(query);
}

Integrating Kill Switches into Runbooks

The flag implementation is half the work. The other half is making sure on-call engineers know the flags exist and can use them under pressure.

Add a "Kill Switches" section to every runbook that covers a service with flags:

## Kill Switches

| Flag | Effect | Fallback |
|------|--------|----------|
| kill.checkout.new-flow | Reverts to v1 checkout | Legacy path, tested |
| kill.payments.stripe-v2 | Falls back to Stripe v1 | ~3% slower, stable |

To toggle: go to [flag dashboard link] or run:
  flagctl set kill.checkout.new-flow true

Don't make engineers search for the flag name during an incident. Put it in the runbook, next to the alert that would trigger its use.

Lifecycle and Cleanup

The biggest operational problem with kill switches isn't adding them, it's forgetting to remove them.

A flag left enabled for months becomes load-bearing. Engineers forget the fallback path exists, stop testing it, and eventually it breaks. Then during an incident, you toggle the kill switch and fall back to code that hasn't been tested in a year.

A few practices that help:

  • Set a TTL in your tracking system. When you add a kill switch to a service, open a ticket to remove it 60-90 days after the feature stabilizes.
  • Review enabled kill switches in postmortems. If a kill switch got toggled during an incident, either remove the flag (if the feature is now stable) or extend its lifetime consciously.
  • Test fallback paths in game days. Periodically exercise kill switches in a staging or controlled production environment so you know the fallback still works.

Common Mistakes

Wrapping too little. The kill switch sits at the API handler level but the database migration already ran. Toggling the flag hides the UI but the new schema is still live. Flag the feature at every layer that changed.

No fallback. The flag is there, but turning it on just returns a 500 because nobody implemented the old code path. Always test what happens when the flag is enabled before you need it in production.

Sharing flags across services. kill.new-infra that affects checkout, payments, and auth is a single point of failure for your mitigation. Scope flags to one service and one feature.

Evaluating flags on every request without caching. A synchronous HTTP call to a flag service on every request adds latency and creates a new failure dependency. Use a local cache with a short TTL (5-30 seconds).

Where to Start

If you don't have kill switches today, pick one: your highest-traffic feature that's had an incident in the last six months. Add a flag, implement the fallback path, test it in staging, and add it to the runbook. That's it.

Once on-call engineers see how much faster mitigation gets, the pattern spreads on its own.

Related Posts