Taking on 2–3 new engagements in 2026 — EST & PST hours

Start a Conversation

45 minutes to 2. Across 200+ tickets.

Portkey is an AI gateway platform handling billions of API requests from AI-first companies. Every support ticket - rate limit regressions, latency spikes, webhook failures, billing discrepancies - required a full engineering investigation. After deploying Altor: median investigation time dropped to 2 minutes. Zero changes to existing workflows.

45→2

minutes per investigation, consistently across 200+ tickets

200+

tickets diagnosed in production since deployment

6

production systems connected: ClickHouse, Linear, Stripe, GitHub, docs, StatusPage

2 wks

from kickoff to first live investigation running on real tickets

The problem: every ticket was a 45-minute debugging session

Portkey is an AI infrastructure company. Their customers are engineers building on LLMs - Anthropic, OpenAI, Mistral, and dozens of other providers routing through the Portkey gateway. When something breaks, their customers do not file vague tickets. They report exact symptoms: "my p95 latency jumped 200ms," "my Llama 3 fallback stopped firing," "I'm getting 429s from the gateway on my Claude requests."

These tickets cannot be answered from a knowledge base. Every one of them required Portkey's team to open ClickHouse, run queries against the customer's API logs, check Linear for known bugs, look at recent GitHub deploys, and verify billing in Stripe. One ticket. Six browser tabs. 20-45 minutes. Every time.

At Portkey's scale, this was the single largest bottleneck in their support operation. Not response time. Not ticket routing. The investigation itself.

The deployment: 3 weeks from kickoff to production

  • Week 1 - Stack audit: We mapped Portkey's ClickHouse schema, Linear project structure, Stripe billing setup, and GitHub deploy cadence. Identified the top 5 ticket types by volume: rate limit issues, latency spikes, webhook failures, billing discrepancies, and model fallback failures.
  • Week 2 - Integrations live: Read-only connections established to all 6 systems. First investigations running on real tickets from Portkey's active queue. Engineering lead involved in reviewing and validating early diagnoses.
  • Weeks 3-4 - Playbooks tuned: Investigation logic refined against actual ticket patterns. By the end of week 4, 80% of ticket types had reusable investigation playbooks. Median time: 2 minutes.

A real investigation: rate limit regression

A customer reports: "My API calls are returning 429s. This started about 2 hours ago."

Altor receives the ticket and the customer's account ID. It runs the following in parallel:

  • ClickHouse - Queries 429 error rate for this customer's API calls over the last 24 hours. Finds: 12% error rate baseline, spiked to 43% at 09:14 UTC. Spike correlates to a specific endpoint.
  • Linear - Searches for open issues matching "rate limit" and the affected endpoint. Finds: LIN-482 "rate limit regression on /v1/chat" - open, priority urgent, assigned.
  • Stripe - Checks subscription tier, usage limits, and current period usage. Finds: Plan active, usage within limits. Not a billing-related rate limit.
  • GitHub - Pulls recent merges to the rate-limiting middleware. Finds: PR #891 "fix/rate-limit" - currently in review, expected merge within 3 days.
  • Diagnosis delivered in 94 seconds: "Known regression LIN-482 causing elevated 429s on /v1/chat since 09:14 UTC. Patch in PR #891, ETA 3 days. Workaround: reduce concurrency or add exponential backoff. No billing issue involved."

"Altor diagnosed in 2 minutes what used to take our engineers 45 minutes of copying data between tabs. Our tickets are investigations, not FAQs - nobody else could even attempt to answer them automatically. Altor can because it queries our actual production data."

— Engineering Lead, Portkey

The result: investigation time eliminated as a bottleneck

After 200+ tickets diagnosed across all major ticket types, the investigation phase effectively stopped being a bottleneck. Support agents receive a structured diagnosis before they finish reading the ticket. Engineers are no longer pulled in for routine investigations. Escalations dropped.

The investigation logic also became more accurate over time. Early playbooks covered the top 3 ticket types. After 200+ tickets, 80% of all ticket types had reusable investigation logic - including edge cases that would have been hard to anticipate at the outset.

  • 2 min median investigation time (down from 45 min)
  • 200+ tickets diagnosed in production
  • 6 production systems queried per investigation
  • 80% of investigation logic reusable across ticket types
  • Zero changes to existing support workflows or tooling

Your stack looks like Portkey's. See what Altor finds.

We will connect to your systems during the demo and run a live investigation on a ticket from your queue. Your data, your stack, diagnosed in real time.

Get weekly support engineering insights

Opens your email app with your address prefilled.