From "webhooks stopped" to root cause in 2 minutes.
Webhook failure tickets are high-urgency and high-complexity. The customer is losing events, and the root cause could be anywhere: their endpoint, your delivery pipeline, an upstream provider outage, or a billing issue. Altor investigates all of them simultaneously.
Why webhook tickets take so long manually
A customer reports: "Our webhook endpoint stopped receiving events." The support engineer's investigation path:
- Check delivery logs in ClickHouse. Look at success rate over the last 4-6 hours. Find it dropped from 98% to 12%.
- Check what errors the endpoint is returning. Find 503 responses - customer's server is unreachable.
- Check Stripe to rule out billing. Subscription active, webhook quota not exceeded.
- Check StatusPage for upstream outages. Find AWS us-east-1 is degraded - matches the customer's region.
- Synthesize: customer endpoint is down due to AWS outage. Events are queued for retry. No data loss.
typical manual investigation for a webhook failure ticket
systems checked: delivery logs, endpoint status, billing, upstream incidents
Altor's investigation time for the same ticket
How Altor investigates webhook failures
Altor runs all the same checks - but simultaneously, in under 2 minutes:
- Queries ClickHouse: webhook delivery success rate dropped from 98% to 12% over the last 4 hours. Endpoint returning 503.
- Checks Stripe: subscription active, webhook quota not exceeded. Not a billing issue.
- Checks StatusPage: AWS us-east-1 degraded - matches customer's region.
- Delivers diagnosis: customer endpoint is down due to regional AWS degradation. Events are queued and will auto-retry. No data loss.
"Webhook failures used to be our scariest tickets - the customer thinks they're losing data. Now we have the full picture in 2 minutes: what's failing, why, and whether events are safe."
Webhook failure patterns Altor handles
Every webhook failure has a different root cause. Altor investigates across all common patterns:
- •Endpoint down (503/502) - identifies whether it's the customer's server or an upstream outage
- •Timeout failures - checks if payload size increased or endpoint response time degraded
- •Authentication rejected (401/403) - verifies webhook signing secret rotation and credential status
- •Rate limiting (429) - checks if delivery volume exceeded the customer's endpoint capacity
- •SSL/TLS errors - identifies certificate expiration or misconfiguration
- •Partial failures - compares delivery rates across event types to isolate the affected subset
See Altor investigate a real webhook failure
We'll connect to your delivery logs, billing, and monitoring systems and diagnose a webhook issue from your queue - live.
Get weekly support engineering insights
Opens your email app with your address prefilled.