An AI agent interview evaluates how well a candidate uses tools like Cursor, Claude Code, or GitHub Copilot — not whether they can write code without them. You score on four dimensions: verification (do they catch model errors?), prompt quality (do they get usable output in 1–2 turns?), orchestration judgment (do they break tasks correctly?), and ownership (can they defend every line?). The format is live, screen-shared, real repo, AI tools allowed.
Why the Hiring Signal Has Moved
In 2022, a technical interview measured something real: could this person write correct code without help? That question made sense when the job was to write correct code without help.
In 2026, that is not the job. The job is to steer AI agents toward a correct result, verify what they produce, reject what's wrong, and ship code you can defend line by line. Closed-book LeetCode selects for the wrong skill while actively screening out the candidate's actual workflow.
The companies that figured this out first — Meta, Google, Canva, Shopify, Sierra, Rippling — restructured at least one interview round around AI-assisted work. They didn't do it to be generous to candidates. They did it because the alternative is hiring the wrong person.
What to Actually Measure
The evaluation target shifts entirely when you allow AI tools. You're no longer asking "can this person write code?" You're asking: can this person produce a trustworthy result using AI, and do they know when to trust it?
This breaks into five observable behaviors:
- Scoping before prompting. Strong candidates read the codebase, understand the task, and write a mental (or literal) spec before they type a single prompt. Weak candidates paste the problem statement directly into Claude Code and accept whatever comes back.
- Prompt precision. The quality of the prompt predicts the quality of the output. A precise prompt with constraints, context, and explicit scope gets a usable result in one or two turns. A vague prompt produces vague output that requires 8 more turns to become usable — and usually still has bugs.
- Verification discipline. Every model produces wrong output sometimes. The question is whether the candidate notices. Do they read the diff? Do they run a test? Do they ask "why did it choose this approach?" Strong candidates write a failing test before fixing anything. They treat model confidence as a hypothesis, not a fact.
- Orchestration judgment. Can they decompose a task correctly — breaking it into atomic subtasks the agent can execute well? Do they know when to keep the agent in the loop versus when to take the wheel? Can they fan out parallel agents when appropriate and stay sequential when the task has dependencies?
- Ownership. This is the final gate. Remove the AI. Ask them to explain any line in the diff. A strong candidate can. A weak one says "the AI wrote that part." Ownership doesn't mean writing every line — it means understanding every line you're merging.
The Scoring Rubric (4 Dimensions)
This is the rubric Altor uses when conducting AI agent interviews on behalf of engineering teams. It's adapted from rubrics used at Meta and Canva's AI-enabled interview rounds, refined for evaluating tool-specific fluency.
| Dimension | Weight | 1–2 (Failing) | 3 (Passing) | 4–5 (Strong) |
|---|---|---|---|---|
| Verification | 40% | Accepts model output without reading it. No tests run. Ships hallucinated code. | Reads the diff. Runs existing tests. Catches obvious errors. | Writes a failing test before fixing. Questions confidently wrong output. Checks edge cases the model missed. Asks "what would break this?" |
| Prompt Judgment | 25% | Pastes full problem as one prompt. Vague requests. 8+ turns for a simple task. Never provides context about the codebase. | Gives useful context. Gets to a workable result in 3–5 turns. Iterates reasonably. | Precise, atomic prompts. Explicit scope and constraints. Uses plan mode before implementing. Gets usable result in 1–2 turns. Knows when to stop using AI. |
| Ownership | 20% | Can't explain lines in the diff. Credits the AI, not themselves. "I think the AI did something with that." | Can explain the overall approach. Fuzzy on some details. | Can explain every line under questioning. Can rewrite any section without AI if pressed. Diff is theirs, not the model's. |
| Orchestration | 15% | Treats agent as autocomplete. Single giant prompt for complex tasks. No task decomposition. Never uses multi-step planning. | Breaks tasks into 2–3 steps. Some awareness of sequential vs. parallel. | Decomposes complex work into atomic subtasks. Knows when to fan out vs. stay sequential. Uses agent checkpoints. Reviews intermediate output before proceeding. |
Three Interview Formats That Work
Live Agent Session: Real Repo, Real Bug
Give the candidate a mid-sized real repo. Plant one or two bugs — a race condition, a swallowed error, an off-by-one in pagination. Allow any AI tool. Observe: do they read the codebase first? Write a spec? Review the diff? Run tests? Ask them to walk you through their decisions after. The transcript of their AI session is an artifact you review together.
PR Review: Agent-Generated Code With Hidden Defects
Hand a 200–300 line PR that an AI generated. Three changes are subtly wrong — a fabricated import, a null check missing, a logic inversion. Can they find all three in 20 minutes? This tests both senior code-review skill and the "trust but verify" reflex simultaneously. No AI tool needed — this is pure judgment.
Spec-First Build: Write the Prompt Contract First
Give a small feature spec. Ask the candidate to write out their agent prompt contract before touching any code — what scope, what constraints, what they'll verify. Then let them build it. You score the prompt contract as much as the result. Weak candidates write "build me X." Strong candidates write a precise brief with edge cases and exit criteria spelled out.
Red Flags and Green Flags
✓ Green Flags
- Opens the repo, reads
CLAUDE.mdorAGENTS.mdbefore prompting - Uses plan mode in Claude Code before any implementation
- Writes a failing test before accepting the AI's fix
- "The model suggested X but I rejected it because Y"
- Gets to a working result in 2 turns; doesn't iterate endlessly
- Asks "what would break this?" after the AI produces output
- Can walk through any line in the diff from memory
- Explicitly scopes what's out-of-scope before prompting
- Pushes back when the AI is confidently wrong
- Knows when to switch from AI to manual — and does
✗ Red Flags
- Pastes the full problem description as the first prompt
- Accepts generated code without reading it
- "I think the AI handled that" under questioning
- No test written or run after AI produces output
- 10+ turns to complete a task that should take 2
- Treats model confidence as proof of correctness
- Ships a solution they can't explain
- Doesn't notice a fabricated import in a 50-line diff
- Prompts the same vague thing multiple times hoping for better
- Never asked the AI to explain its own reasoning
Token Efficiency as a Hiring Signal
This is the dimension that nobody else is measuring — and it's one of the strongest predictors of real-world productivity.
Token efficiency means: how many prompts and messages does a candidate need to accomplish a defined task? Every additional turn is a signal. A candidate who gets to a working, verified result in 2 prompts thinks more clearly than one who needs 12. The difference is not typing speed. It's cognitive clarity about what they want before they ask for it.
You measure it by reviewing the AI session transcript after the interview. Claude Code keeps session logs. Cursor Composer shows the full history. Copilot Chat has a session view. Three metrics matter:
- Turns to first usable output: Fewer is better. A precise prompt gets usable output in one turn. A vague prompt requires 3–5 clarifying back-and-forth turns before the model produces something worth reading.
- Rejection rate: What percentage of model suggestions did they reject, and why? Zero rejections is a red flag — it means they accepted everything. Rejecting one in four suggestions is healthy. Always rejecting suggests they don't know how to prompt well. The ratio matters less than the reason for each rejection.
- Context window management: Did they provide sufficient context upfront, or did they start a new conversation mid-task because the model "forgot" what they were doing? Strong candidates front-load context. Weak candidates restart constantly.
How to Review AI Session Transcripts
The session transcript is the most underutilized interview artifact. Here's how to read it:
What to look for in Claude Code transcripts
- Plan mode usage: Did they hit
/planbefore implementing? Using plan mode before touching code is the single strongest signal of disciplined AI-native engineering. It shows they think about scope before execution. - Tool call quality: Did they use targeted tools (Read specific files, Grep for patterns) or did they ask the agent to "look at everything"? Targeted tool use means lower token waste and faster results.
- Rejection messages: When the agent proposed something wrong, what did the candidate say? "No, that's wrong because the race condition occurs at the point of write, not read" is a 5/5 response. "Try again" is a 2/5.
- Checkpoint behavior: Did they stop and review at natural breakpoints, or did they let the agent run continuously and review at the end? Mid-task verification is better — errors compound when caught late.
What to look for in Cursor transcripts
- Composer vs. Chat usage: Composer handles multi-file changes. Chat handles single-file edits and questions. Appropriate tool selection for the task type is a signal.
- Diff acceptance rate: Accepting all diffs without inspection is a red flag. A strong candidate inspects, selects, and sometimes modifies before accepting.
- Prompt length distribution: A few long, precise prompts is better than many short, vague ones. The shape of the conversation tells you how the candidate thinks.
What Meta, Google, Shopify, and Sierra Actually Do
| Company | Format | AI Tool | What They Score |
|---|---|---|---|
| Meta | 60-min live, replaces one traditional coding round | GPT-5, Claude Sonnet, Gemini 2.5 Pro, Llama 4 | Prompt quality, verification, multi-checkpoint thematic project |
| Human-led, AI-assisted round | Gemini | AI fluency, prompt engineering, output validation, debugging skills | |
| Shopify | Live session, screen-share, real repo | Any | Verification reflex, judgment under pressure, ownership of output |
| Sierra | PR-from-a-colleague format (AI-generated draft) | Any for review | Defect detection, cross-cutting change judgment, iteration with agents |
| Canva | Live agent-orchestration task | Any | Task decomposition, fan-out vs. sequential judgment, code ownership |
| AES (YC-backed) | CLAUDE.md / AGENTS.md portfolio review + live task | Claude Code | Existing agent workflow artifacts, schema literacy, diff-reading |
The pattern is consistent: allow AI tools, observe the interaction, score judgment not output. No company in this list is running a closed-book whiteboard in 2026. The ones still running LeetCode are having retention problems — they hired the 2022 shape of engineer at 2026 prices and are wondering why productivity is flat.
The CLAUDE.md Signal
One of the sharpest pre-interview signals has nothing to do with the interview itself. Ask candidates to share their personal CLAUDE.md or AGENTS.md file. These files encode how an engineer works with AI agents — their personal conventions, project rules, and operational expectations for AI collaborators. A well-structured CLAUDE.md is a higher-signal artifact than a portfolio, a degree, or any certification. A missing one, or one that says "be helpful", tells you everything you need to know.
Frequently Asked Questions
Should you allow AI tools in every technical interview?
Yes, for most engineering roles in 2026. The exception is when you specifically need to see unaided reasoning — for example, a system design discussion where you want to probe first-principles thinking without scaffolding. But for any role where the daily workflow involves shipping with AI in the loop, testing without AI tests a fiction. Reserve AI-free segments for specific moments, not entire rounds.
How do you prevent cheating if AI is allowed?
Live screen-sharing with mandatory camera. The interview isn't about whether the candidate uses AI — it's about how they use it. You're watching the session in real time. After the session, you ask them to explain every decision. Cheating via AI requires real-time help from a human co-pilot; that's hard to hide under live questioning. The ownership walkthrough ("explain this line to me") breaks hidden-assistance strategies quickly.
What role seniority does this format work for?
All of them, with different calibration. For L3/L4, you're checking baseline verification discipline and prompt fundamentals. For L5+, you're checking orchestration judgment, architectural decisions made while directing agents, and the ability to scope and decompose large problems. Senior candidates should be able to run four or five parallel agent tasks while reviewing the output of a fifth — and explain every decision in each.
How long does an AI agent interview take?
A full format takes 90 minutes: 60 minutes live session + 30 minutes transcript review and defense. A streamlined version is 60 minutes. Unlike traditional interviews, you get an artifact (the AI session transcript) that you can study after — which means you don't have to make all your evaluation decisions in the room.
What's the biggest mistake companies make in AI agent interviewing?
Scoring on output quality instead of process quality. Two candidates can produce identical-looking diffs. The one who got there in 2 precise prompts with a failing test catching a model error is a completely different engineer than the one who got there in 14 turns of vague iteration with no tests and a diff they can't explain. The transcript tells the story. The diff alone does not.
Run Your AI Agent Interviews With Altor
Altor conducts live AI agent proficiency interviews on behalf of US engineering teams — so your engineers don't have to rebuild this process from scratch. We evaluate Cursor, Claude Code, and GitHub Copilot fluency using a structured rubric. You get a scored report, the session transcript, and a hire/no-hire recommendation.
Related reading: AI Agent Interview Service — how Altor runs it · Download the scoring rubric · AI strategy consulting vs. AI implementation