API Incident Diagnosis

How to Investigate 429 Errors in Your API

Published Mar 23, 2026 · 12 min read

A 429 response is simple on paper: too many requests. In practice, investigating 429 incidents in a B2B environment is rarely simple. Customers report "your API is failing," but the real cause can be bursty client behavior, a mismatched rate-limit policy, a billing entitlement issue, or a backend regression introduced by a deploy.

According to a 2025 Zendesk benchmark study of US SaaS companies, technical queues are under the most pressure to improve first-contact resolution, and the average US support team now handles 400+ tickets per week. That is why investigation speed compounds into real SLA and renewal risk.

The fastest support teams do not jump to assumptions. They run a structured investigation that confirms symptom, isolates scope, and correlates with product and account state. This guide gives a practical sequence you can use immediately.

What causes 429 errors

Because multiple causes produce the same response code, diagnosis requires evidence beyond a single log sample.

Step 1: quantify the symptom in logs

Start by measuring exactly when and where 429s increased. You need time-bucketed rates by customer and endpoint, not just raw totals.

SELECT
  toStartOfFiveMinute(ts) AS bucket,
  customer_id,
  endpoint,
  countIf(status_code = 429) AS throttled,
  count() AS total,
  round(throttled / total, 4) AS throttle_rate
FROM api_requests
WHERE ts >= now() - INTERVAL 6 HOUR
  AND customer_id = 'cust_abc123'
GROUP BY bucket, customer_id, endpoint
ORDER BY bucket ASC, throttle_rate DESC;

This immediately answers critical questions: did the increase begin at a specific timestamp, is one endpoint dominant, and is the issue sustained or bursty.

Step 2: determine blast radius

Next, identify whether the event is tenant-specific or platform-wide. If multiple unrelated customers show synchronized 429 spikes, suspect shared infrastructure or policy changes.

SELECT
  toStartOfFiveMinute(ts) AS bucket,
  countDistinct(customer_id) AS affected_customers,
  countIf(status_code = 429) AS throttled_requests
FROM api_requests
WHERE ts >= now() - INTERVAL 6 HOUR
GROUP BY bucket
ORDER BY bucket ASC;

A single-customer spike points toward account-level behavior. Multi-customer spikes elevate urgency and often justify incident protocol.

Step 3: inspect rate-limit dimensions

Confirm what key was used for limiting (API key, org ID, IP, endpoint, region) and whether traffic distribution changed. A common failure mode is an accidental key-collision where multiple customers share a limiter bucket.

SELECT
  limiter_key,
  countIf(status_code = 429) AS throttled,
  any(rate_limit_policy) AS policy,
  min(ts) AS first_seen,
  max(ts) AS last_seen
FROM api_requests
WHERE ts >= now() - INTERVAL 2 HOUR
  AND customer_id = 'cust_abc123'
GROUP BY limiter_key
ORDER BY throttled DESC;

Step 4: validate billing and entitlements

After confirming the technical pattern, check Stripe and entitlement state. Investigations frequently discover plan transitions, failed renewals, or quota sync lag that changed effective limits. Support teams often miss this because they assume 429 is purely backend behavior.

Step 5: check known bugs and recent deploys

Search Linear for open issues tagged rate-limit, throttling, or quota. Then correlate GitHub deploy history with symptom onset. If 429 rates jump minutes after a deploy, this significantly increases probability of regression.

Useful habit: include commit hash, rollout window, and affected service in the escalation packet. It helps engineering validate causality quickly.

Step 6: produce a diagnosis, not just logs

The final output should answer four things clearly: what happened, who is affected, most likely cause, and what action is being taken. Customers need confidence and timelines; engineering needs precision and evidence links.

Example summary: "429 rate increased from 1.8% to 22.4% for customer cust_abc123 on /v1/chat between 10:15-10:40 UTC. No similar spike across other tenants. Stripe shows plan unchanged; entitlement sync healthy. Recent deploy 7b2f3c changed endpoint-level limiter normalization. Likely regression. Escalated to API platform owner with query outputs attached."

Automated investigation approach

These checks are predictable and repeated. That makes 429 investigation an excellent candidate for automation. A system like Altor can trigger this workflow as soon as a ticket arrives, run ClickHouse queries, check Linear and Stripe context, and correlate GitHub changes automatically.

Support then reviews a complete diagnosis package rather than manually assembling one from scratch. This often cuts first-diagnosis time from tens of minutes to a few minutes for recurring patterns.

What good teams communicate to customers during 429 incidents

Clear communication reduces repeat ticket noise and builds trust, especially when root cause is still being finalized.

Post-incident actions to prevent repeat 429s

Closing the ticket is not the end of the work. Strong teams run a short post-incident loop for high-impact 429 events. This includes validating whether limiter policy defaults are still appropriate for current customer traffic profiles, checking if retry guidance in public docs is explicit enough, and adding alert thresholds for abnormal throttle-rate changes by endpoint.

You should also convert investigation findings into detection rules. For example, if a key-collision bug caused broad throttling, add automated guardrails that detect unusually high customer-to-limiter-key cardinality shifts. If a deploy introduced incorrect quota interpretation, add pre-release integration tests using realistic plan transitions. These preventive controls reduce recurring support load and improve customer confidence during peak traffic periods.

Want a live 429 investigation workflow for your queue?

Altor connects to ClickHouse, Linear, Stripe, and GitHub so support teams can diagnose throttling incidents with evidence in minutes.

Book a Demo (US Hours)

Related reading

Back to Blog · Go to Homepage