How We Deployed a Production AI Investigation Engine in 3 Weeks

May 24, 2026 · Altor · 9 min read

Direct Answer

We deployed Portkey's production AI investigation engine by starting with the exact systems support engineers already used, not a separate AI sandbox. Week 1 mapped six systems and three investigation paths. Week 2 put read-only integrations live and ran the first real tickets. Week 3 pushed playbook tuning, confidence scoring, and path-specific refinements using live outcomes. Portkey reached first production investigations in 14 days, cut median investigation time from 45 minutes to 2 minutes, and reused about 80% of the logic after 200 tickets because the engine learned from repeat patterns instead of answering from static docs.

Most AI support projects start from documents. This one started from work. Portkey's support engineers were already doing investigations across ClickHouse, Linear, Stripe, GitHub, Pylon, and Statuspage. The problem was not lack of knowledge. The problem was time. Every serious ticket required a human to open six tabs, build a timeline, check account state, compare incidents, and decide whether the failure came from billing, deploys, or customer usage.

That is why the system was built as an investigation engine, not a chatbot. A chatbot answers from text it already has. An investigation engine queries live systems, joins evidence, and produces a diagnosis for one specific ticket right now.

Portkey Deployment Facts

Week 1: stack audit and path design

The first week was not prompt writing. It was workflow mapping. We sat with the support and engineering teams and traced what happened from the moment a ticket entered Pylon to the moment a human sent the final answer. Every step that took human time became a candidate system query or rule.

The six-system map looked like this. ClickHouse held request volume, latency, and error events. Linear held bug tickets and priority state. Stripe told us whether the account was blocked by billing state or plan changes. GitHub showed deployment timing and rollback history. Pylon held the customer ticket narrative. Statuspage showed whether the reported issue matched a live incident window.

From that map we defined three canonical investigation paths because those paths covered the largest share of repeated work.

Path Primary systems Key question
API errors ClickHouse, GitHub, Linear Did a deploy, known bug, or request pattern create the error burst?
Billing escalations Stripe, Pylon, ClickHouse Is access or quota behavior explained by account state, usage, or plan limits?
Webhook failures ClickHouse, GitHub, Statuspage, Linear Is the failure tied to latency, retries, deploy regressions, or an incident window?

The audit also set the access model. Every integration was read-only. The engine could query data, correlate evidence, and return a diagnosis, but it could not modify tickets, billing state, code, or incident records. That is important in week one because it removes the approval bottleneck. Humans stay in control while the system proves accuracy on live work.

"The reason this moved fast is that we did not ask the team to change its workflow. We copied the workflow, instrumented it, and let the system run the read path first."
— Altor deployment note from the Portkey engagement, 2026

Week 2: read-only integrations live

By the second week, the integrations were querying live systems and the engine was running on actual support tickets. That matters more than any offline benchmark. The goal was not to achieve a perfect demo score. The goal was to see what broke on real tickets and where the playbooks needed structure.

The first investigations showed three things immediately. First, many tickets were under-specified in natural language but over-specified in telemetry. A short customer note like "getting 429s" became actionable once the engine pulled the account, request burst, deploy window, and known regressions. Second, false positives were concentrated in cases where two paths looked similar at the start, especially rate limits versus general latency. Third, the ticket system alone did not contain enough truth to produce a diagnosis with confidence.

Initial false positives came from one common pattern: the model would over-index on the most recent incident note or the most recent deployment even when billing or account limits were the better explanation. The fix was not a better paragraph. The fix was path-specific ordering. Billing checks had to happen before some classes of deployment reasoning. Known bug links had to be validated against the account's actual feature flags.

That is the difference between an AI investigation engine and a help assistant. The engine gets better when the query order gets better. It improves through evidence selection and path constraints, not through sounding more polished.

Week 3 and week 4: playbook refinement on live tickets

Production was live by day 14, but the third week and early fourth week were where the playbooks became durable. We split the noisy cases into distinct branches instead of one generic failure path. Three branches mattered most.

Rate limit regression path

This path checked request bursts, account plan state, token patterns, and any recent policy changes. It needed to separate customer misuse from platform regressions. The engine learned to compare the ticket window against historical request shape and recent quota changes before raising a deploy suspicion.

Latency spike path

Latency spikes looked similar to rate limit events at first because both surfaced as failed requests. The refined path prioritized percentile latency, service-level incident markers, and deploy timing before it considered quota or account state. That reduced path confusion and improved diagnosis quality on broad service degradation events.

Webhook failure path

Webhook failures needed a different sequence. Delivery attempts, endpoint behavior, retries, and recent release changes all mattered. The playbook queried event-level traces first, then cross-checked deploy history and active incident windows, then searched Linear for matching regressions. This path ended up with the clearest evidence chains because it tied one failed event to one specific source record.

Confidence scoring came from evidence depth, not model tone. A diagnosis scored higher when the engine had matching signals across several systems: for example, a GitHub deploy inside the same window as a ClickHouse error burst and a linked Linear bug. A soft diagnosis scored lower when only one signal existed or when several paths remained plausible. This let support teams know when to trust the draft and when to escalate fast.

What changed after 200 tickets

After roughly 200 investigations, about 80% of the logic was reusable. That does not mean every future ticket looked the same. It means most of the expensive work was now encoded: which systems to query, in what order, what evidence patterns mattered, and what uncertainty looked like. The remaining 20% was where product changes, new edge cases, and new support motions introduced fresh branches.

This is why the system keeps improving after deployment. Each real ticket either confirms an existing playbook or forces a new branch. The result is a tighter investigation graph over time. Human engineers no longer spend their time rebuilding the same diagnosis from scratch. They review structured findings and handle the uncommon cases.

Why this is different from a chatbot

A chatbot answers from documents and prior text. That is useful for help-center questions. It is weak for support diagnosis because the answer often lives in systems, not documents. To explain one failed request, you need current usage data, the exact account state, the deployment window, the bug tracker, and the incident timeline. That data changes hourly.

An AI investigation engine is built for those live joins. It queries systems, forms a hypothesis, tests it against more evidence, and returns a structured result. The structure matters: root cause, supporting evidence, confidence, next action, and missing context if the ticket still needs a human.

See the full support investigation case study, how the system fits into the Altor platform, and why the workflow is common among API-first developer tools companies.

Frequently Asked Questions

How do you build an AI investigation engine?

You start by mapping the live systems humans already check during an investigation, then connect those systems with read-only access, define canonical paths, and tune the playbooks against real tickets. The key difference from a chatbot is that the engine queries production data on each case instead of answering from static documents.

What systems does Altor connect to?

In the Portkey deployment, Altor connected ClickHouse for usage and error data, Linear for bugs and priorities, Stripe for billing state, GitHub for deployments, Pylon for ticket context, and Statuspage for incident state. The stack can expand as long as the source has an API or query layer.

How long does it take to deploy?

Portkey reached first live production investigations in 14 days, with the first three weeks covering audit, integration, and workflow tuning. That is fast because the work starts from an existing ticket workflow and live systems, not from a blank platform build.

What is read-only by default?

Read-only means the system can query logs, bug trackers, billing state, deployment history, and ticket data without changing records or taking actions in those systems. The first phase is diagnosis only. Human operators stay in control of escalations and customer responses.

How do AI investigation playbooks work?

A playbook is a repeatable investigation path for a ticket type. For example, an API error playbook checks request traces, recent deployments, known bugs, and account state in a set order. Each resolved ticket sharpens that path so future investigations get faster and more precise.

If your support team already checks several systems per ticket, book a 30-minute scoping call. We'll map the investigation paths and tell you whether a read-only deployment can reach production in weeks, not quarters.

Related