Score each dimension from 1–5 during or immediately after the interview. Multiply each score by its weight and sum for a total. A score of 4.0+ = strong hire, 3.5–3.9 = conditional hire, 3.0–3.4 = requires a second opinion, below 3.0 = no-hire. A score below 3 on Verification is a hard no-hire regardless of total — a candidate who ships hallucinated code is a production liability at any level.
Dimension 1: Verification (40%)
The single most predictive signal of AI-native engineering safety. Does the candidate verify what the model produces, or do they treat model confidence as correctness?
Wrote a failing test before accepting the AI's fix. Caught a model hallucination within 2 minutes and explained exactly why it was wrong. Checked edge cases the model missed. Treated every model output as a hypothesis, not a fact. "The model suggested X but I noticed Y would break because Z."
Ran existing tests after accepting output. Caught the obvious error in the diff. Showed healthy skepticism — questioned at least one AI decision. Missed one subtle issue but found it during the walkthrough when asked.
Read the diff before accepting. Ran tests when reminded. Didn't write new tests but verified the change didn't break existing ones. Missed one non-obvious issue. Can explain the code when asked but didn't proactively check edge cases.
Accepted diffs without reading them. Ran no tests. Only noticed the obvious error when the interviewer pointed it out. Treated model confidence as verification. "The AI said it was right, so I assumed it was."
Shipped code without reading it. No tests. Couldn't explain what the change actually does. Accepted a hallucinated import. Would have merged this to production as-is.
Dimension 2: Prompt Quality (25%)
How many turns to a usable result? Fewer is almost always better. The quality of a prompt is a direct reflection of how clearly the candidate thinks about the problem before delegating.
First prompt included codebase context, explicit scope, constraints, and an exit criterion. Got usable output in 1–2 turns. Used plan mode before implementation. Knew when to stop using AI and handle manually. Zero vague "try again" messages — every follow-up added new information.
Prompts were specific and contextual. Reached a working result in 3–4 turns. Provided codebase context. One prompt was imprecise and required a clarifying follow-up but recovered quickly. Generally efficient.
Got to a working result but took 5–7 turns. Some prompts were vague. Provided partial codebase context. A few "try again without explanation" messages. Workable but not efficient. Would accumulate technical debt in a live codebase through imprecise agent direction.
Pasted the problem description as the first prompt without framing. 8+ turns. Repeated the same vague request multiple times hoping for a different result. Provided no codebase context. Got there eventually but through brute-force iteration, not clear thinking.
Vague mega-prompts with no context. Never reached a working result independently. No scope discipline — asked the AI to "fix everything." Session transcript shows a pattern of restating the same request with minor rewording. Could not direct the agent effectively.
Dimension 3: Code Ownership (20%)
The final gate. After the session, remove the AI. Ask them to explain any line in the diff. A strong candidate can. Ownership doesn't require writing every line — it requires understanding every line.
Explained every line in the diff without hesitation. When asked "why did the model choose this approach?" gave a correct, precise answer. Could rewrite any section manually if asked. Made conscious decisions to accept, reject, or modify each AI suggestion. Diff feels owned, not inherited.
Explained most of the diff confidently. Momentarily fuzzy on one section but recovered. Clearly read and understood the change before accepting. Made conscious decisions on most AI suggestions.
Can explain the overall approach but not all the details. "The model handled that specific edge case — I trust it but haven't verified it fully." Read the diff but accepted some sections without full understanding.
"The AI wrote that part." Multiple lines they cannot explain. Accepted diffs wholesale. Under questioning, reveals they did not read significant sections of the code they're about to merge.
Cannot explain any line in the diff. "I trust the AI." Would not be able to debug this code tomorrow if it breaks. Treats the agent as an authority, not a collaborator. Would ship this to production without understanding what it does.
Dimension 4: Orchestration Judgment (15%)
Can they decompose complex tasks correctly? Do they know when to fan out parallel agents versus stay sequential? Can they work with multi-agent systems without losing track of state?
Broke the problem into atomic subtasks before starting. Made a deliberate fan-out decision ("these two tasks have no shared state so I'll run them in parallel"). Used checkpoints — reviewed intermediate output before proceeding. Knew when to take the wheel and when to keep the agent in the loop. Could articulate why they chose each sequencing decision.
Decomposed the task into meaningful subtasks. Mostly sequential but identified one valid parallel opportunity. Reviewed output at natural breakpoints. Showed awareness of task dependencies.
Broke the task into 2–3 parts. Executed sequentially throughout — no fan-out — but the sequencing was mostly reasonable. Checkpoint behavior was ad hoc. Got to the right result but through a less efficient path than possible.
Treated the agent as autocomplete, not a coordinator. Single giant prompt for a complex task. No task decomposition. Let the agent run continuously and reviewed only at the end — errors compounded. No awareness of sequential vs. parallel tradeoffs.
Asked the agent to "build the whole feature." No decomposition, no checkpoints. Completely reactive — prompted only when something visibly broke. Does not understand how to direct an agent system for work that has dependencies.
Scoring Calculator
Weighted Score Formula
| Dimension | Your Score (1–5) | Weight | Weighted Score |
|---|---|---|---|
| Verification | ____ | × 0.40 | = ____ |
| Prompt Quality | ____ | × 0.25 | = ____ |
| Code Ownership | ____ | × 0.20 | = ____ |
| Orchestration | ____ | × 0.15 | = ____ |
| Total | = ____ |
4.0–5.0: Strong hire · 3.5–3.9: Conditional hire · 3.0–3.4: Requires second opinion · <3.0: No-hire · Verification <3: Hard no-hire regardless of total
Red Flags and Green Flags Quick Reference
✓ Green Flags
- Opens CLAUDE.md or AGENTS.md before prompting
- Uses plan mode before any implementation
- Writes a failing test before accepting the AI's fix
- Explicitly rejects a model suggestion with clear reasoning
- Gets to working output in under 3 turns
- Asks "what would break this?" unprompted
- Can explain every line from memory
- Scopes what's out-of-scope before prompting
- Deliberately chose NOT to use AI for a specific part
- Uses plan mode / checkpoint pattern at natural breakpoints
✗ Red Flags
- Pastes full problem as first prompt, no scoping
- Accepts diffs without reading them
- "The AI handled that part" under questioning
- No test written or run after AI produces output
- 10+ turns for a task that should take 2
- Treats model confidence as fact
- Can't explain the approach the model chose
- Didn't notice a fabricated import in a 50-line diff
- Repeats the same vague prompt hoping for different output
- Never asked the model to explain its own reasoning
Transcript Review Checklist
Review the candidate's AI session transcript after the interview. This is where the real signal is.
- Did they use plan mode before any implementation step?
- How many turns to first usable output? (Target: ≤ 3)
- Did they provide codebase context in the first prompt?
- What percentage of suggestions did they reject? (0% = red flag)
- When they rejected, did they explain why? (Reason required for credit)
- Did they start a new conversation mid-task? (Context loss = red flag)
- Did they review intermediate output at natural breakpoints?
- Did they write or run a test at any point in the session?
- Was there any "try again" or "that's wrong, fix it" without explanation?
- What was the shape of the conversation — a few long precise prompts, or many short vague ones?
Want Altor to run the interview for you?
We conduct live AI agent proficiency interviews on behalf of US engineering teams. You get the scored report — we do the work.
See also: Complete guide to AI agent interviewing · Altor's interview service · Altor vs. Karat