2026 Complete Guide

How to Interview Engineers on AI Agent Proficiency

The hiring signal has moved. 91% of engineers use agentic AI daily. Here's the only framework that actually measures it.

By Altor·Updated July 2026·2,800 words
Quick Answer

An AI agent interview evaluates how well a candidate uses tools like Cursor, Claude Code, or GitHub Copilot — not whether they can write code without them. You score on four dimensions: verification (do they catch model errors?), prompt quality (do they get usable output in 1–2 turns?), orchestration judgment (do they break tasks correctly?), and ownership (can they defend every line?). The format is live, screen-shared, real repo, AI tools allowed.

Why the Hiring Signal Has Moved

91%
of US engineers use agentic AI coding tools daily (CodeSignal, 2026)
75%
have shipped production code that was partially AI-generated in the last 6 months
71%
of engineering leaders say AI has made technical skills harder to assess
25%
of US employers now explicitly permit AI during interviews, heading to 50% within 12 months

In 2022, a technical interview measured something real: could this person write correct code without help? That question made sense when the job was to write correct code without help.

In 2026, that is not the job. The job is to steer AI agents toward a correct result, verify what they produce, reject what's wrong, and ship code you can defend line by line. Closed-book LeetCode selects for the wrong skill while actively screening out the candidate's actual workflow.

The companies that figured this out first — Meta, Google, Canva, Shopify, Sierra, Rippling — restructured at least one interview round around AI-assisted work. They didn't do it to be generous to candidates. They did it because the alternative is hiring the wrong person.

The take-home problem: 45% of US employers still send take-home assessments, but trust is gone. Nobody can tell if the candidate wrote the code or pasted the ticket into Claude Code at midnight. Live coding with a screen share and AI explicitly allowed solves this — you see how they work, not just what they produce.

What to Actually Measure

The evaluation target shifts entirely when you allow AI tools. You're no longer asking "can this person write code?" You're asking: can this person produce a trustworthy result using AI, and do they know when to trust it?

This breaks into five observable behaviors:

  1. Scoping before prompting. Strong candidates read the codebase, understand the task, and write a mental (or literal) spec before they type a single prompt. Weak candidates paste the problem statement directly into Claude Code and accept whatever comes back.
  2. Prompt precision. The quality of the prompt predicts the quality of the output. A precise prompt with constraints, context, and explicit scope gets a usable result in one or two turns. A vague prompt produces vague output that requires 8 more turns to become usable — and usually still has bugs.
  3. Verification discipline. Every model produces wrong output sometimes. The question is whether the candidate notices. Do they read the diff? Do they run a test? Do they ask "why did it choose this approach?" Strong candidates write a failing test before fixing anything. They treat model confidence as a hypothesis, not a fact.
  4. Orchestration judgment. Can they decompose a task correctly — breaking it into atomic subtasks the agent can execute well? Do they know when to keep the agent in the loop versus when to take the wheel? Can they fan out parallel agents when appropriate and stay sequential when the task has dependencies?
  5. Ownership. This is the final gate. Remove the AI. Ask them to explain any line in the diff. A strong candidate can. A weak one says "the AI wrote that part." Ownership doesn't mean writing every line — it means understanding every line you're merging.

The Scoring Rubric (4 Dimensions)

This is the rubric Altor uses when conducting AI agent interviews on behalf of engineering teams. It's adapted from rubrics used at Meta and Canva's AI-enabled interview rounds, refined for evaluating tool-specific fluency.

Dimension Weight 1–2 (Failing) 3 (Passing) 4–5 (Strong)
Verification 40% Accepts model output without reading it. No tests run. Ships hallucinated code. Reads the diff. Runs existing tests. Catches obvious errors. Writes a failing test before fixing. Questions confidently wrong output. Checks edge cases the model missed. Asks "what would break this?"
Prompt Judgment 25% Pastes full problem as one prompt. Vague requests. 8+ turns for a simple task. Never provides context about the codebase. Gives useful context. Gets to a workable result in 3–5 turns. Iterates reasonably. Precise, atomic prompts. Explicit scope and constraints. Uses plan mode before implementing. Gets usable result in 1–2 turns. Knows when to stop using AI.
Ownership 20% Can't explain lines in the diff. Credits the AI, not themselves. "I think the AI did something with that." Can explain the overall approach. Fuzzy on some details. Can explain every line under questioning. Can rewrite any section without AI if pressed. Diff is theirs, not the model's.
Orchestration 15% Treats agent as autocomplete. Single giant prompt for complex tasks. No task decomposition. Never uses multi-step planning. Breaks tasks into 2–3 steps. Some awareness of sequential vs. parallel. Decomposes complex work into atomic subtasks. Knows when to fan out vs. stay sequential. Uses agent checkpoints. Reviews intermediate output before proceeding.
On weighting: Verification carries 40% for a reason. An engineer who can't catch model errors is a production liability regardless of how fast they prompt. The skill that keeps AI-native teams safe is the "trust but verify" reflex — and it's the hardest to fake in a live session.

Three Interview Formats That Work

Format 1 — 60 min

Live Agent Session: Real Repo, Real Bug

Give the candidate a mid-sized real repo. Plant one or two bugs — a race condition, a swallowed error, an off-by-one in pagination. Allow any AI tool. Observe: do they read the codebase first? Write a spec? Review the diff? Run tests? Ask them to walk you through their decisions after. The transcript of their AI session is an artifact you review together.

Format 2 — 30 min

PR Review: Agent-Generated Code With Hidden Defects

Hand a 200–300 line PR that an AI generated. Three changes are subtly wrong — a fabricated import, a null check missing, a logic inversion. Can they find all three in 20 minutes? This tests both senior code-review skill and the "trust but verify" reflex simultaneously. No AI tool needed — this is pure judgment.

Format 3 — 45 min

Spec-First Build: Write the Prompt Contract First

Give a small feature spec. Ask the candidate to write out their agent prompt contract before touching any code — what scope, what constraints, what they'll verify. Then let them build it. You score the prompt contract as much as the result. Weak candidates write "build me X." Strong candidates write a precise brief with edge cases and exit criteria spelled out.

Combining formats: Format 1 + Format 2 together takes 90 minutes and gives you signal on both generation and verification — the two halves of AI-native engineering. Format 3 alone is the fastest signal on prompt quality. All three together is the full picture.

Red Flags and Green Flags

✓ Green Flags

  • Opens the repo, reads CLAUDE.md or AGENTS.md before prompting
  • Uses plan mode in Claude Code before any implementation
  • Writes a failing test before accepting the AI's fix
  • "The model suggested X but I rejected it because Y"
  • Gets to a working result in 2 turns; doesn't iterate endlessly
  • Asks "what would break this?" after the AI produces output
  • Can walk through any line in the diff from memory
  • Explicitly scopes what's out-of-scope before prompting
  • Pushes back when the AI is confidently wrong
  • Knows when to switch from AI to manual — and does

✗ Red Flags

  • Pastes the full problem description as the first prompt
  • Accepts generated code without reading it
  • "I think the AI handled that" under questioning
  • No test written or run after AI produces output
  • 10+ turns to complete a task that should take 2
  • Treats model confidence as proof of correctness
  • Ships a solution they can't explain
  • Doesn't notice a fabricated import in a 50-line diff
  • Prompts the same vague thing multiple times hoping for better
  • Never asked the AI to explain its own reasoning

Token Efficiency as a Hiring Signal

This is the dimension that nobody else is measuring — and it's one of the strongest predictors of real-world productivity.

Token efficiency means: how many prompts and messages does a candidate need to accomplish a defined task? Every additional turn is a signal. A candidate who gets to a working, verified result in 2 prompts thinks more clearly than one who needs 12. The difference is not typing speed. It's cognitive clarity about what they want before they ask for it.

You measure it by reviewing the AI session transcript after the interview. Claude Code keeps session logs. Cursor Composer shows the full history. Copilot Chat has a session view. Three metrics matter:

The transcript review method: Ask the candidate to share their Claude Code or Cursor session history after the interview. Read it the way you'd read a PR. The prompts are the spec. The model responses are the first draft. The candidate's acceptance/rejection decisions are the review. You learn more from 5 minutes of transcript analysis than from 30 minutes of behavioral questions.

How to Review AI Session Transcripts

The session transcript is the most underutilized interview artifact. Here's how to read it:

What to look for in Claude Code transcripts

What to look for in Cursor transcripts

What Meta, Google, Shopify, and Sierra Actually Do

Company Format AI Tool What They Score
Meta 60-min live, replaces one traditional coding round GPT-5, Claude Sonnet, Gemini 2.5 Pro, Llama 4 Prompt quality, verification, multi-checkpoint thematic project
Google Human-led, AI-assisted round Gemini AI fluency, prompt engineering, output validation, debugging skills
Shopify Live session, screen-share, real repo Any Verification reflex, judgment under pressure, ownership of output
Sierra PR-from-a-colleague format (AI-generated draft) Any for review Defect detection, cross-cutting change judgment, iteration with agents
Canva Live agent-orchestration task Any Task decomposition, fan-out vs. sequential judgment, code ownership
AES (YC-backed) CLAUDE.md / AGENTS.md portfolio review + live task Claude Code Existing agent workflow artifacts, schema literacy, diff-reading

The pattern is consistent: allow AI tools, observe the interaction, score judgment not output. No company in this list is running a closed-book whiteboard in 2026. The ones still running LeetCode are having retention problems — they hired the 2022 shape of engineer at 2026 prices and are wondering why productivity is flat.

The CLAUDE.md Signal

One of the sharpest pre-interview signals has nothing to do with the interview itself. Ask candidates to share their personal CLAUDE.md or AGENTS.md file. These files encode how an engineer works with AI agents — their personal conventions, project rules, and operational expectations for AI collaborators. A well-structured CLAUDE.md is a higher-signal artifact than a portfolio, a degree, or any certification. A missing one, or one that says "be helpful", tells you everything you need to know.

Frequently Asked Questions

Should you allow AI tools in every technical interview?

Yes, for most engineering roles in 2026. The exception is when you specifically need to see unaided reasoning — for example, a system design discussion where you want to probe first-principles thinking without scaffolding. But for any role where the daily workflow involves shipping with AI in the loop, testing without AI tests a fiction. Reserve AI-free segments for specific moments, not entire rounds.

How do you prevent cheating if AI is allowed?

Live screen-sharing with mandatory camera. The interview isn't about whether the candidate uses AI — it's about how they use it. You're watching the session in real time. After the session, you ask them to explain every decision. Cheating via AI requires real-time help from a human co-pilot; that's hard to hide under live questioning. The ownership walkthrough ("explain this line to me") breaks hidden-assistance strategies quickly.

What role seniority does this format work for?

All of them, with different calibration. For L3/L4, you're checking baseline verification discipline and prompt fundamentals. For L5+, you're checking orchestration judgment, architectural decisions made while directing agents, and the ability to scope and decompose large problems. Senior candidates should be able to run four or five parallel agent tasks while reviewing the output of a fifth — and explain every decision in each.

How long does an AI agent interview take?

A full format takes 90 minutes: 60 minutes live session + 30 minutes transcript review and defense. A streamlined version is 60 minutes. Unlike traditional interviews, you get an artifact (the AI session transcript) that you can study after — which means you don't have to make all your evaluation decisions in the room.

What's the biggest mistake companies make in AI agent interviewing?

Scoring on output quality instead of process quality. Two candidates can produce identical-looking diffs. The one who got there in 2 precise prompts with a failing test catching a model error is a completely different engineer than the one who got there in 14 turns of vague iteration with no tests and a diff they can't explain. The transcript tells the story. The diff alone does not.

Run Your AI Agent Interviews With Altor

Altor conducts live AI agent proficiency interviews on behalf of US engineering teams — so your engineers don't have to rebuild this process from scratch. We evaluate Cursor, Claude Code, and GitHub Copilot fluency using a structured rubric. You get a scored report, the session transcript, and a hire/no-hire recommendation.

Related reading: AI Agent Interview Service — how Altor runs it · Download the scoring rubric · AI strategy consulting vs. AI implementation