2024 / case study

AI-assisted resume evaluation for internal recruiting

Cut candidate screening time by 60% with a custom reasoning pipeline that hiring managers actually trusted.

Domain: AI integration · Workflow automation
Role: Engineering owner, end to end
Stack: OpenAI APINestJSTypeScriptBackground workersPostgreSQL

The problem

Recruiting was drowning. Every requisition was generating dozens of resumes, and the early-screening step was the bottleneck. Hiring managers were spending an hour at a time scanning PDFs, often pattern matching on the wrong signals because they were tired.

The brief was small and pointed: cut screening time without breaking the trust hiring managers have in the process. A black-box "AI score" was explicitly off the table. Reviewers had to be able to see why a resume scored the way it did, and override the model whenever they disagreed.

What I built

A pipeline that runs alongside the existing applicant tracking system, not on top of it.

Deterministic scoring rubrics. Each requisition has a structured rubric: must-have qualifications, nice-to-haves, anti-signals, and weighting. The rubric lives as data and is editable by recruiters without engineering involvement.

LLM calls wrapped in structure. When a candidate enters the pipeline, the resume is parsed, the rubric is loaded, and the model is asked to produce a structured evaluation: per-criterion judgement with quoted evidence from the resume, plus a summary. The model does not produce a final score directly. Scores are computed deterministically from the per-criterion judgements and the rubric weights, so the same evaluation always produces the same number.

Background processing with retries and audit. Evaluations run in a queued worker. Each run is recorded with its rubric version, model version, prompt template, raw model response, and final score. Reruns are explicit and audited.

Surfaced inside the existing tool. Reviewers do not visit a new product. The evaluation appears as a panel inside the candidate page they already use, with the per-criterion evidence inline. They can agree, override, or escalate, and their decision feeds back into our quality metrics.

Decisions and tradeoffs

Structured outputs, not free text. Free-text model output is easier to prompt and impossible to ship safely. The pipeline forces the model to fill a JSON schema, validates it, and rejects malformed responses. This sometimes means the model returns nothing useful and we treat the candidate as un-screened, which is the correct conservative behavior.

Deterministic final scores. A score that drifts between runs is not a score, it is noise. By computing the score from the model's structured judgements rather than asking the model for a number, we get a stable artifact that reviewers can argue with productively.

Quoted evidence over confidence numbers. Reviewers do not want "92% confident." They want "this candidate matches because their resume says X." The prompt enforces evidence quotes, and the UI shows them next to the criterion. Trust went up. Override rates went down.

Auditability over throughput. Every evaluation is durable. We could have stripped the audit log to save storage. We did not. The first time a hiring manager asked "why did this candidate get flagged," the audit log paid for itself.

Outcome

Average screening time per candidate dropped by 60%. Override rates settled into a range that suggested reviewers trusted the model on the easy calls and engaged with it on the hard ones. The pipeline became the template for other AI-assisted workflows the team picked up next.

Next case studyLeave management platform for 30,000+ employees