From "webhooks stopped" to root cause in 2 minutes.

Webhook failure tickets are high-urgency and high-complexity. The customer is losing events, and the root cause could be anywhere: their endpoint, your delivery pipeline, an upstream provider outage, or a billing issue. Altor investigates all of them simultaneously.

Why webhook tickets take so long manually

A customer reports: "Our webhook endpoint stopped receiving events." The support engineer's investigation path:

  1. Check delivery logs in ClickHouse. Look at success rate over the last 4–6 hours. Find it dropped from 98% to 12%.
  2. Check what errors the endpoint is returning. Find 503 responses — customer's server is unreachable.
  3. Check Stripe to rule out billing. Subscription active, webhook quota not exceeded.
  4. Check StatusPage for upstream outages. Find AWS us-east-1 is degraded — matches the customer's region.
  5. Synthesize: customer endpoint is down due to AWS outage. Events are queued for retry. No data loss.
25–40 min

typical manual investigation for a webhook failure ticket

4+

systems checked: delivery logs, endpoint status, billing, upstream incidents

2 min

Altor's investigation time for the same ticket

How Altor investigates webhook failures

Altor runs all the same checks — but simultaneously, in under 2 minutes:

  1. Queries ClickHouse: webhook delivery success rate dropped from 98% to 12% over the last 4 hours. Endpoint returning 503.
  2. Checks Stripe: subscription active, webhook quota not exceeded. Not a billing issue.
  3. Checks StatusPage: AWS us-east-1 degraded — matches customer's region.
  4. Delivers diagnosis: customer endpoint is down due to regional AWS degradation. Events are queued and will auto-retry. No data loss.

"Webhook failures used to be our scariest tickets — the customer thinks they're losing data. Now we have the full picture in 2 minutes: what's failing, why, and whether events are safe."

— Engineering lead, Portkey

Webhook failure patterns Altor handles

Every webhook failure has a different root cause. Altor investigates across all common patterns:

  • Endpoint down (503/502) — identifies whether it's the customer's server or an upstream outage
  • Timeout failures — checks if payload size increased or endpoint response time degraded
  • Authentication rejected (401/403) — verifies webhook signing secret rotation and credential status
  • Rate limiting (429) — checks if delivery volume exceeded the customer's endpoint capacity
  • SSL/TLS errors — identifies certificate expiration or misconfiguration
  • Partial failures — compares delivery rates across event types to isolate the affected subset

See Altor investigate a real webhook failure

We'll connect to your delivery logs, billing, and monitoring systems and diagnose a webhook issue from your queue — live.

Get a 3-minute walkthrough — no call needed.