What is an AI agent interview?

An AI agent interview is a technical assessment where candidates use AI coding tools — Cursor, Claude Code, GitHub Copilot — during the interview itself. Interviewers score engineering judgment, prompt quality, verification discipline, and ownership of AI-generated output — not memorized syntax or typing speed.

How do you evaluate token efficiency in a technical interview?

Token efficiency is measured by observing how few prompts and messages a candidate needs to accomplish a defined task. Strong candidates write precise, context-rich prompts that get usable output in 1–2 turns. Weak candidates issue vague mega-prompts, accept hallucinated output, or iterate 8–10 times on a task that should take 2. You can review Claude Code session transcripts or Cursor history after the session to score this systematically.

What does a 5/5 AI agent interview candidate look like?

A 5/5 candidate reads the codebase before prompting, writes a scope contract before delegating to the agent, reviews every line of the generated diff, writes a failing test before accepting a fix, catches model hallucinations within two minutes, and can defend every line under questioning without the AI present. They use plan mode in Claude Code, break large tasks into atomic subtasks, and know when NOT to use AI.

What are the red flags in an AI-agent technical interview?

Key red flags: (1) Pasting the entire problem description as a single prompt without scoping first. (2) Accepting generated code without reading it. (3) Unable to explain any line in the diff when asked. (4) No test written or run after the AI produces code. (5) Treating model confidence as correctness. (6) Prompting the same vague request 5+ times hoping for a different result. (7) No understanding of why the AI chose the approach it chose.

Should you allow AI tools in technical interviews?

Yes. 91% of US engineers use agentic AI tools daily. Banning them tests a fictional version of the job. Companies including Meta, Google, Canva, Shopify, and Rippling already allow — and require — AI tools in at least one interview round. The evaluation shifts from 'can they write code?' to 'can they steer AI toward a trustworthy result, verify the output, and own what ships?'

How is an AI agent interview different from a vibe coding interview?

Vibe coding is accepting AI output without verification — shipping whatever the model produces. An AI agent interview specifically tests the opposite: whether the candidate verifies, critiques, and owns what the AI generates. The interview rewards the 'trust but verify' reflex, not the vibe of accepting whatever Claude says.

What is the typical pricing for interview-as-a-service?

Traditional interview-as-a-service providers like Karat charge $200–$450 per interview, require volume commitments, and do not evaluate AI agent proficiency. Altor's AI agent interview service is designed specifically to assess AI-native engineering fluency — the gap the existing market has completely missed.

How to Interview Engineers on AI Agent Proficiency (2026 Complete Guide)

Quick Answer

An AI agent interview evaluates how well a candidate uses tools like Cursor, Claude Code, or GitHub Copilot — not whether they can write code without them. You score on four dimensions: verification (do they catch model errors?), prompt quality (do they get usable output in 1–2 turns?), orchestration judgment (do they break tasks correctly?), and ownership (can they defend every line?). The format is live, screen-shared, real repo, AI tools allowed.

Why the Hiring Signal Has Moved

91%

of US engineers use agentic AI coding tools daily (CodeSignal, 2026)

75%

have shipped production code that was partially AI-generated in the last 6 months

71%

of engineering leaders say AI has made technical skills harder to assess

25%

of US employers now explicitly permit AI during interviews, heading to 50% within 12 months

In 2022, a technical interview measured something real: could this person write correct code without help? That question made sense when the job was to write correct code without help.

In 2026, that is not the job. The job is to steer AI agents toward a correct result, verify what they produce, reject what's wrong, and ship code you can defend line by line. Closed-book LeetCode selects for the wrong skill while actively screening out the candidate's actual workflow.

The companies that figured this out first — Meta, Google, Canva, Shopify, Sierra, Rippling — restructured at least one interview round around AI-assisted work. They didn't do it to be generous to candidates. They did it because the alternative is hiring the wrong person.

The take-home problem: 45% of US employers still send take-home assessments, but trust is gone. Nobody can tell if the candidate wrote the code or pasted the ticket into Claude Code at midnight. Live coding with a screen share and AI explicitly allowed solves this — you see how they work, not just what they produce.

What to Actually Measure

The evaluation target shifts entirely when you allow AI tools. You're no longer asking "can this person write code?" You're asking: can this person produce a trustworthy result using AI, and do they know when to trust it?

This breaks into five observable behaviors:

Scoping before prompting. Strong candidates read the codebase, understand the task, and write a mental (or literal) spec before they type a single prompt. Weak candidates paste the problem statement directly into Claude Code and accept whatever comes back.
Prompt precision. The quality of the prompt predicts the quality of the output. A precise prompt with constraints, context, and explicit scope gets a usable result in one or two turns. A vague prompt produces vague output that requires 8 more turns to become usable — and usually still has bugs.
Verification discipline. Every model produces wrong output sometimes. The question is whether the candidate notices. Do they read the diff? Do they run a test? Do they ask "why did it choose this approach?" Strong candidates write a failing test before fixing anything. They treat model confidence as a hypothesis, not a fact.
Orchestration judgment. Can they decompose a task correctly — breaking it into atomic subtasks the agent can execute well? Do they know when to keep the agent in the loop versus when to take the wheel? Can they fan out parallel agents when appropriate and stay sequential when the task has dependencies?
Ownership. This is the final gate. Remove the AI. Ask them to explain any line in the diff. A strong candidate can. A weak one says "the AI wrote that part." Ownership doesn't mean writing every line — it means understanding every line you're merging.

The Scoring Rubric (4 Dimensions)

This is the rubric Altor uses when conducting AI agent interviews on behalf of engineering teams. It's adapted from rubrics used at Meta and Canva's AI-enabled interview rounds, refined for evaluating tool-specific fluency.

Dimension	Weight	1–2 (Failing)	3 (Passing)	4–5 (Strong)
Verification	40%	Accepts model output without reading it. No tests run. Ships hallucinated code.	Reads the diff. Runs existing tests. Catches obvious errors.	Writes a failing test before fixing. Questions confidently wrong output. Checks edge cases the model missed. Asks "what would break this?"
Prompt Judgment	25%	Pastes full problem as one prompt. Vague requests. 8+ turns for a simple task. Never provides context about the codebase.	Gives useful context. Gets to a workable result in 3–5 turns. Iterates reasonably.	Precise, atomic prompts. Explicit scope and constraints. Uses plan mode before implementing. Gets usable result in 1–2 turns. Knows when to stop using AI.
Ownership	20%	Can't explain lines in the diff. Credits the AI, not themselves. "I think the AI did something with that."	Can explain the overall approach. Fuzzy on some details.	Can explain every line under questioning. Can rewrite any section without AI if pressed. Diff is theirs, not the model's.
Orchestration	15%	Treats agent as autocomplete. Single giant prompt for complex tasks. No task decomposition. Never uses multi-step planning.	Breaks tasks into 2–3 steps. Some awareness of sequential vs. parallel.	Decomposes complex work into atomic subtasks. Knows when to fan out vs. stay sequential. Uses agent checkpoints. Reviews intermediate output before proceeding.

On weighting: Verification carries 40% for a reason. An engineer who can't catch model errors is a production liability regardless of how fast they prompt. The skill that keeps AI-native teams safe is the "trust but verify" reflex — and it's the hardest to fake in a live session.

Three Interview Formats That Work

Format 1 — 60 min

Live Agent Session: Real Repo, Real Bug

Give the candidate a mid-sized real repo. Plant one or two bugs — a race condition, a swallowed error, an off-by-one in pagination. Allow any AI tool. Observe: do they read the codebase first? Write a spec? Review the diff? Run tests? Ask them to walk you through their decisions after. The transcript of their AI session is an artifact you review together.

Format 2 — 30 min

PR Review: Agent-Generated Code With Hidden Defects

Hand a 200–300 line PR that an AI generated. Three changes are subtly wrong — a fabricated import, a null check missing, a logic inversion. Can they find all three in 20 minutes? This tests both senior code-review skill and the "trust but verify" reflex simultaneously. No AI tool needed — this is pure judgment.

Format 3 — 45 min

Spec-First Build: Write the Prompt Contract First

Give a small feature spec. Ask the candidate to write out their agent prompt contract before touching any code — what scope, what constraints, what they'll verify. Then let them build it. You score the prompt contract as much as the result. Weak candidates write "build me X." Strong candidates write a precise brief with edge cases and exit criteria spelled out.

Combining formats: Format 1 + Format 2 together takes 90 minutes and gives you signal on both generation and verification — the two halves of AI-native engineering. Format 3 alone is the fastest signal on prompt quality. All three together is the full picture.

Red Flags and Green Flags

✓ Green Flags

Opens the repo, reads CLAUDE.md or AGENTS.md before prompting
Uses plan mode in Claude Code before any implementation
Writes a failing test before accepting the AI's fix
"The model suggested X but I rejected it because Y"
Gets to a working result in 2 turns; doesn't iterate endlessly
Asks "what would break this?" after the AI produces output
Can walk through any line in the diff from memory
Explicitly scopes what's out-of-scope before prompting
Pushes back when the AI is confidently wrong
Knows when to switch from AI to manual — and does

✗ Red Flags

Pastes the full problem description as the first prompt
Accepts generated code without reading it
"I think the AI handled that" under questioning
No test written or run after AI produces output
10+ turns to complete a task that should take 2
Treats model confidence as proof of correctness
Ships a solution they can't explain
Doesn't notice a fabricated import in a 50-line diff
Prompts the same vague thing multiple times hoping for better
Never asked the AI to explain its own reasoning

Token Efficiency as a Hiring Signal

This is the dimension that nobody else is measuring — and it's one of the strongest predictors of real-world productivity.

Token efficiency means: how many prompts and messages does a candidate need to accomplish a defined task? Every additional turn is a signal. A candidate who gets to a working, verified result in 2 prompts thinks more clearly than one who needs 12. The difference is not typing speed. It's cognitive clarity about what they want before they ask for it.

You measure it by reviewing the AI session transcript after the interview. Claude Code keeps session logs. Cursor Composer shows the full history. Copilot Chat has a session view. Three metrics matter:

Turns to first usable output: Fewer is better. A precise prompt gets usable output in one turn. A vague prompt requires 3–5 clarifying back-and-forth turns before the model produces something worth reading.
Rejection rate: What percentage of model suggestions did they reject, and why? Zero rejections is a red flag — it means they accepted everything. Rejecting one in four suggestions is healthy. Always rejecting suggests they don't know how to prompt well. The ratio matters less than the reason for each rejection.
Context window management: Did they provide sufficient context upfront, or did they start a new conversation mid-task because the model "forgot" what they were doing? Strong candidates front-load context. Weak candidates restart constantly.

The transcript review method: Ask the candidate to share their Claude Code or Cursor session history after the interview. Read it the way you'd read a PR. The prompts are the spec. The model responses are the first draft. The candidate's acceptance/rejection decisions are the review. You learn more from 5 minutes of transcript analysis than from 30 minutes of behavioral questions.

How to Review AI Session Transcripts

The session transcript is the most underutilized interview artifact. Here's how to read it:

What to look for in Claude Code transcripts

Plan mode usage: Did they hit /plan before implementing? Using plan mode before touching code is the single strongest signal of disciplined AI-native engineering. It shows they think about scope before execution.
Tool call quality: Did they use targeted tools (Read specific files, Grep for patterns) or did they ask the agent to "look at everything"? Targeted tool use means lower token waste and faster results.
Rejection messages: When the agent proposed something wrong, what did the candidate say? "No, that's wrong because the race condition occurs at the point of write, not read" is a 5/5 response. "Try again" is a 2/5.
Checkpoint behavior: Did they stop and review at natural breakpoints, or did they let the agent run continuously and review at the end? Mid-task verification is better — errors compound when caught late.

What to look for in Cursor transcripts

Composer vs. Chat usage: Composer handles multi-file changes. Chat handles single-file edits and questions. Appropriate tool selection for the task type is a signal.
Diff acceptance rate: Accepting all diffs without inspection is a red flag. A strong candidate inspects, selects, and sometimes modifies before accepting.
Prompt length distribution: A few long, precise prompts is better than many short, vague ones. The shape of the conversation tells you how the candidate thinks.

What Meta, Google, Shopify, and Sierra Actually Do

Company	Format	AI Tool	What They Score
Meta	60-min live, replaces one traditional coding round	GPT-5, Claude Sonnet, Gemini 2.5 Pro, Llama 4	Prompt quality, verification, multi-checkpoint thematic project
Google	Human-led, AI-assisted round	Gemini	AI fluency, prompt engineering, output validation, debugging skills
Shopify	Live session, screen-share, real repo	Any	Verification reflex, judgment under pressure, ownership of output
Sierra	PR-from-a-colleague format (AI-generated draft)	Any for review	Defect detection, cross-cutting change judgment, iteration with agents
Canva	Live agent-orchestration task	Any	Task decomposition, fan-out vs. sequential judgment, code ownership
AES (YC-backed)	CLAUDE.md / AGENTS.md portfolio review + live task	Claude Code	Existing agent workflow artifacts, schema literacy, diff-reading

The pattern is consistent: allow AI tools, observe the interaction, score judgment not output. No company in this list is running a closed-book whiteboard in 2026. The ones still running LeetCode are having retention problems — they hired the 2022 shape of engineer at 2026 prices and are wondering why productivity is flat.

The CLAUDE.md Signal

One of the sharpest pre-interview signals has nothing to do with the interview itself. Ask candidates to share their personal CLAUDE.md or AGENTS.md file. These files encode how an engineer works with AI agents — their personal conventions, project rules, and operational expectations for AI collaborators. A well-structured CLAUDE.md is a higher-signal artifact than a portfolio, a degree, or any certification. A missing one, or one that says "be helpful", tells you everything you need to know.

Frequently Asked Questions

Should you allow AI tools in every technical interview?

Yes, for most engineering roles in 2026. The exception is when you specifically need to see unaided reasoning — for example, a system design discussion where you want to probe first-principles thinking without scaffolding. But for any role where the daily workflow involves shipping with AI in the loop, testing without AI tests a fiction. Reserve AI-free segments for specific moments, not entire rounds.

How do you prevent cheating if AI is allowed?

Live screen-sharing with mandatory camera. The interview isn't about whether the candidate uses AI — it's about how they use it. You're watching the session in real time. After the session, you ask them to explain every decision. Cheating via AI requires real-time help from a human co-pilot; that's hard to hide under live questioning. The ownership walkthrough ("explain this line to me") breaks hidden-assistance strategies quickly.

What role seniority does this format work for?

All of them, with different calibration. For L3/L4, you're checking baseline verification discipline and prompt fundamentals. For L5+, you're checking orchestration judgment, architectural decisions made while directing agents, and the ability to scope and decompose large problems. Senior candidates should be able to run four or five parallel agent tasks while reviewing the output of a fifth — and explain every decision in each.

How long does an AI agent interview take?

A full format takes 90 minutes: 60 minutes live session + 30 minutes transcript review and defense. A streamlined version is 60 minutes. Unlike traditional interviews, you get an artifact (the AI session transcript) that you can study after — which means you don't have to make all your evaluation decisions in the room.

What's the biggest mistake companies make in AI agent interviewing?

Scoring on output quality instead of process quality. Two candidates can produce identical-looking diffs. The one who got there in 2 precise prompts with a failing test catching a model error is a completely different engineer than the one who got there in 14 turns of vague iteration with no tests and a diff they can't explain. The transcript tells the story. The diff alone does not.

Run Your AI Agent Interviews With Altor

Altor conducts live AI agent proficiency interviews on behalf of US engineering teams — so your engineers don't have to rebuild this process from scratch. We evaluate Cursor, Claude Code, and GitHub Copilot fluency using a structured rubric. You get a scored report, the session transcript, and a hire/no-hire recommendation.

Book a Discovery Call Email amanda@altorlab.xyz