NearIRM Team
NearIRM Team6 min read

When the Outage Isn't Yours: Handling Third-Party Vendor Incidents

Your alerting fires at 2 AM. Payments are failing. You pull up logs, check your services, and find nothing obviously wrong on your end. Then you check Stripe's status page: "Investigating elevated error rates on payment processing." The outage is theirs, but the incident is still yours to manage.

Third-party vendor incidents are one of the most frustrating situations in incident response. You can't fix the underlying problem. Your customers don't care whose fault it is. And the timeline is entirely out of your control.

Here's how to handle them without making things worse.

First: confirm it's actually the vendor

Before you declare a vendor incident, rule out your own infrastructure. A misconfigured deployment or a rate limit you hit yourself can look identical to a vendor problem from the outside.

Check these things fast:

  • Did anything deploy in the last hour?
  • Are error rates elevated across all vendors or just one?
  • Are your internal services healthy (CPU, memory, queue depth)?
  • Is the failure affecting all customers or a subset?

Once you're confident the problem is external, check the vendor's status page. Most have one. AWS uses the AWS Service Health Dashboard, Stripe has status.stripe.com, and so on. Bookmark these. You'll need them at the worst possible times.

Search Twitter/X and engineering community Slack channels too. Other companies seeing the same thing will post about it, often faster than any official status update.

Communicate internally before you have answers

The instinct in vendor incidents is to wait until you know what's happening before telling anyone. Don't. Your on-call responders, engineering leads, and customer-facing teams need to know immediately that something is wrong, even if the root cause is still unclear.

A brief internal incident message early is better than silence followed by a wall of updates. Something like:

Incident open: Payment failures elevated since ~02:14 UTC. Initial investigation points to Stripe. We're monitoring their status page and assessing mitigation options. No customer comms sent yet.

That's enough. It sets expectations, tells people where to look, and gets everyone on the same page. You can update it as you learn more.

What to actually do while you wait

You can't fix the vendor's infrastructure, but you're not helpless. Work through these in parallel:

Check if you have mitigation options. Does your architecture support graceful degradation? If Stripe is down, can you queue payment attempts for retry rather than failing immediately? Can you show customers a clear error instead of a timeout? Even partial fallbacks reduce blast radius.

Review your circuit breaker settings. If you're hammering a degraded vendor with retries, you're making their recovery harder and yours slower. Make sure your retry logic has exponential backoff and that you're not generating unnecessary load on an already struggling service.

Look at your vendor contracts. Enterprise plans often include dedicated support lines or technical account managers. If you're on one, use it. Don't just watch the status page. Get a human on the phone who can tell you the actual timeline.

Prepare customer communication, but don't send it yet. Draft your status page update, your support response, and your email template. Waiting to draft until after the incident ends wastes time. Have the message ready to go the moment you decide to send it.

Communicating with affected customers

When to go public with customer communication is a judgment call. A few factors:

  • How long has the impact been visible? If customers are already seeing errors and reaching out to support, you need to post something now.
  • How long is this likely to last? Five minutes of payment failures is different from two hours.
  • What can customers actually do? If the answer is "nothing, just wait," tell them that clearly.

Be direct on your status page. "We are experiencing payment processing issues due to a problem with our payment provider. Our engineering team is monitoring the situation and will update this page as we have more information." That's honest, accurate, and doesn't overpromise a fix time you don't control.

What customers don't want: "We are aware of an issue affecting some users." That says nothing. Tell them what's broken and what you know so far.

Escalating with the vendor

Most vendor incidents you'll wait out by watching a status page. For serious, prolonged outages affecting a large portion of your customers, push harder.

Open a support ticket immediately, even if you expect an automated response. Ticket timestamps matter. If the vendor's SLA includes a response window, the clock starts when you submit.

If you have a technical account manager, call or message them directly. Explain your specific impact (transaction volume lost, users affected, revenue hit). Vendors prioritize incidents differently based on who's affected and how badly.

For AWS, the Personal Health Dashboard shows incidents affecting your specific account, which is more relevant than the general status page. You can also open a Business or Enterprise support case and request a call.

Document every interaction with the vendor: timestamps, what was said, who you spoke to. You'll need this for the postmortem and potentially for credits or SLA claims.

The postmortem still matters

Even when the root cause was entirely outside your control, run a postmortem. The questions to focus on:

QuestionWhy it matters
How long did it take us to identify the vendor as the cause?Faster diagnosis means faster customer communication
Did our monitoring surface the problem, or did customers report it first?Gaps in synthetic monitoring or external checks
Did our mitigation options (circuit breakers, fallbacks) work as expected?Identifies architectural weaknesses
What did our customer communication look like?Tone, timing, accuracy

You won't fix the vendor's infrastructure, but you can fix your detection time, your fallback behavior, and your communication process. Those are worth improving.

Longer-term: reduce your blast radius

Single points of failure in your payment stack, email provider, or cloud region are risks you can reduce. Not eliminate, but reduce.

Some options to evaluate:

  • Redundant providers for critical paths. Dual-provider payment processing is expensive to build but recovers from exactly this scenario.
  • Queuing over synchronous calls. If a task doesn't need to complete in real-time, put it in a queue. The vendor comes back, the queue drains.
  • Vendor SLAs and credits. Know what your contract actually guarantees. Some vendors offer credits for extended outages. Most don't apply them automatically.
  • Synthetic monitoring on vendor-dependent flows. If your checkout flow breaks every time Stripe coughs, you want your monitoring to catch that, not your customers.

Vendor incidents will happen. AWS will have an us-east-1 event. Your payment processor will have a bad afternoon. The question is how fast you catch it, how well you communicate, and how much of your product still works when it does.

Related Posts