How do you score an AI agent technical interview?

Score on four weighted dimensions: Verification (40%) — does the candidate catch model errors and write tests? Prompt Quality (25%) — how many turns to a usable result, and how precise are the prompts? Ownership (20%) — can they explain every line in the diff without the AI? Orchestration (15%) — do they decompose tasks correctly and know when to use AI vs. handle manually? Each dimension scores 1–5. Weighted average gives a total score: 4.0+ is a strong hire, 3.0–3.9 is conditional, below 3.0 is a no-hire.

What is a passing score on an AI agent interview rubric?

A weighted score of 3.5 or above is the passing threshold for most mid-level (L4) engineering roles. Senior and staff roles should clear 4.0. The most important dimension is Verification — a candidate who scores below 3 on Verification is a production liability regardless of their other scores, because they will ship hallucinated output.

What should I look for when reviewing a Claude Code session transcript?

Review: (1) Did they use plan mode before implementation? (2) How many turns to first usable output — fewer is better. (3) What percentage of model suggestions did they reject, and why? (4) Did they provide codebase context upfront or start a new conversation mid-task? (5) When the model was wrong, did they catch it and explain why? The transcript tells you more than the final diff.

AI Agent Interview Rubric — Free Scoring Framework for Engineering Interviews (2026)

How to use this rubric

Score each dimension from 1–5 during or immediately after the interview. Multiply each score by its weight and sum for a total. A score of 4.0+ = strong hire, 3.5–3.9 = conditional hire, 3.0–3.4 = requires a second opinion, below 3.0 = no-hire. A score below 3 on Verification is a hard no-hire regardless of total — a candidate who ships hallucinated code is a production liability at any level.

Dimension 1: Verification (40%)

The single most predictive signal of AI-native engineering safety. Does the candidate verify what the model produces, or do they treat model confidence as correctness?

40%

Verification Discipline

Does the candidate catch model errors, write tests, and question confident-but-wrong output?

5 / 5

Wrote a failing test before accepting the AI's fix. Caught a model hallucination within 2 minutes and explained exactly why it was wrong. Checked edge cases the model missed. Treated every model output as a hypothesis, not a fact. "The model suggested X but I noticed Y would break because Z."

4 / 5

Ran existing tests after accepting output. Caught the obvious error in the diff. Showed healthy skepticism — questioned at least one AI decision. Missed one subtle issue but found it during the walkthrough when asked.

3 / 5

Read the diff before accepting. Ran tests when reminded. Didn't write new tests but verified the change didn't break existing ones. Missed one non-obvious issue. Can explain the code when asked but didn't proactively check edge cases.

2 / 5

Accepted diffs without reading them. Ran no tests. Only noticed the obvious error when the interviewer pointed it out. Treated model confidence as verification. "The AI said it was right, so I assumed it was."

1 / 5

Shipped code without reading it. No tests. Couldn't explain what the change actually does. Accepted a hallucinated import. Would have merged this to production as-is.

Dimension 2: Prompt Quality (25%)

How many turns to a usable result? Fewer is almost always better. The quality of a prompt is a direct reflection of how clearly the candidate thinks about the problem before delegating.

25%

Prompt Quality & Token Efficiency

Precision of prompts, turns to first usable output, context provided, scope discipline

5 / 5

First prompt included codebase context, explicit scope, constraints, and an exit criterion. Got usable output in 1–2 turns. Used plan mode before implementation. Knew when to stop using AI and handle manually. Zero vague "try again" messages — every follow-up added new information.

4 / 5

Prompts were specific and contextual. Reached a working result in 3–4 turns. Provided codebase context. One prompt was imprecise and required a clarifying follow-up but recovered quickly. Generally efficient.

3 / 5

Got to a working result but took 5–7 turns. Some prompts were vague. Provided partial codebase context. A few "try again without explanation" messages. Workable but not efficient. Would accumulate technical debt in a live codebase through imprecise agent direction.

2 / 5

Pasted the problem description as the first prompt without framing. 8+ turns. Repeated the same vague request multiple times hoping for a different result. Provided no codebase context. Got there eventually but through brute-force iteration, not clear thinking.

1 / 5

Vague mega-prompts with no context. Never reached a working result independently. No scope discipline — asked the AI to "fix everything." Session transcript shows a pattern of restating the same request with minor rewording. Could not direct the agent effectively.

Dimension 3: Code Ownership (20%)

The final gate. After the session, remove the AI. Ask them to explain any line in the diff. A strong candidate can. Ownership doesn't require writing every line — it requires understanding every line.

20%

Code Ownership

Can they defend every line in the diff? Can they rewrite any section without AI if pressed?

5 / 5

Explained every line in the diff without hesitation. When asked "why did the model choose this approach?" gave a correct, precise answer. Could rewrite any section manually if asked. Made conscious decisions to accept, reject, or modify each AI suggestion. Diff feels owned, not inherited.

4 / 5

Explained most of the diff confidently. Momentarily fuzzy on one section but recovered. Clearly read and understood the change before accepting. Made conscious decisions on most AI suggestions.

3 / 5

Can explain the overall approach but not all the details. "The model handled that specific edge case — I trust it but haven't verified it fully." Read the diff but accepted some sections without full understanding.

2 / 5

"The AI wrote that part." Multiple lines they cannot explain. Accepted diffs wholesale. Under questioning, reveals they did not read significant sections of the code they're about to merge.

1 / 5

Cannot explain any line in the diff. "I trust the AI." Would not be able to debug this code tomorrow if it breaks. Treats the agent as an authority, not a collaborator. Would ship this to production without understanding what it does.

Dimension 4: Orchestration Judgment (15%)

Can they decompose complex tasks correctly? Do they know when to fan out parallel agents versus stay sequential? Can they work with multi-agent systems without losing track of state?

15%

Orchestration & Task Decomposition

Fan-out vs. sequential decisions, task decomposition, checkpoint behavior, multi-agent patterns

5 / 5

Broke the problem into atomic subtasks before starting. Made a deliberate fan-out decision ("these two tasks have no shared state so I'll run them in parallel"). Used checkpoints — reviewed intermediate output before proceeding. Knew when to take the wheel and when to keep the agent in the loop. Could articulate why they chose each sequencing decision.

4 / 5

Decomposed the task into meaningful subtasks. Mostly sequential but identified one valid parallel opportunity. Reviewed output at natural breakpoints. Showed awareness of task dependencies.

3 / 5

Broke the task into 2–3 parts. Executed sequentially throughout — no fan-out — but the sequencing was mostly reasonable. Checkpoint behavior was ad hoc. Got to the right result but through a less efficient path than possible.

2 / 5

Treated the agent as autocomplete, not a coordinator. Single giant prompt for a complex task. No task decomposition. Let the agent run continuously and reviewed only at the end — errors compounded. No awareness of sequential vs. parallel tradeoffs.

1 / 5

Asked the agent to "build the whole feature." No decomposition, no checkpoints. Completely reactive — prompted only when something visibly broke. Does not understand how to direct an agent system for work that has dependencies.

Scoring Calculator

Weighted Score Formula

Dimension	Your Score (1–5)	Weight	Weighted Score
Verification	____	× 0.40	= ____
Prompt Quality	____	× 0.25	= ____
Code Ownership	____	× 0.20	= ____
Orchestration	____	× 0.15	= ____
Total			= ____

4.0–5.0: Strong hire · 3.5–3.9: Conditional hire · 3.0–3.4: Requires second opinion · <3.0: No-hire · Verification <3: Hard no-hire regardless of total

Red Flags and Green Flags Quick Reference

✓ Green Flags

Opens CLAUDE.md or AGENTS.md before prompting
Uses plan mode before any implementation
Writes a failing test before accepting the AI's fix
Explicitly rejects a model suggestion with clear reasoning
Gets to working output in under 3 turns
Asks "what would break this?" unprompted
Can explain every line from memory
Scopes what's out-of-scope before prompting
Deliberately chose NOT to use AI for a specific part
Uses plan mode / checkpoint pattern at natural breakpoints

✗ Red Flags

Pastes full problem as first prompt, no scoping
Accepts diffs without reading them
"The AI handled that part" under questioning
No test written or run after AI produces output
10+ turns for a task that should take 2
Treats model confidence as fact
Can't explain the approach the model chose
Didn't notice a fabricated import in a 50-line diff
Repeats the same vague prompt hoping for different output
Never asked the model to explain its own reasoning

Transcript Review Checklist

Review the candidate's AI session transcript after the interview. This is where the real signal is.

Claude Code / Cursor Session Review

Did they use plan mode before any implementation step?
How many turns to first usable output? (Target: ≤ 3)
Did they provide codebase context in the first prompt?
What percentage of suggestions did they reject? (0% = red flag)
When they rejected, did they explain why? (Reason required for credit)
Did they start a new conversation mid-task? (Context loss = red flag)
Did they review intermediate output at natural breakpoints?
Did they write or run a test at any point in the session?
Was there any "try again" or "that's wrong, fix it" without explanation?
What was the shape of the conversation — a few long precise prompts, or many short vague ones?

Want Altor to run the interview for you?

We conduct live AI agent proficiency interviews on behalf of US engineering teams. You get the scored report — we do the work.

Book a Discovery Call Email amanda@altorlab.xyz

AI Agent Interview Scoring Rubric

Dimension 1: Verification (40%)

Dimension 2: Prompt Quality (25%)

Dimension 3: Code Ownership (20%)

Dimension 4: Orchestration Judgment (15%)

Scoring Calculator

Weighted Score Formula

Red Flags and Green Flags Quick Reference

✓ Green Flags

✗ Red Flags

Transcript Review Checklist

Want Altor to run the interview for you?