Classification Eval Review

What Are Evals?

Automated tests that measure classification accuracy. Like unit tests for your LLM. Every day, the system checks whether job postings are correctly classified by function (product, engineering, etc.) and location (US vs international).

The Golden Set

A curated collection of carefully-chosen hard cases. The LLM is tested against these daily to ensure it hasn't regressed. If golden set accuracy drops below 95%, the eval aborts and no auto-fixes are applied. Think of it as the "canary in the coal mine" for classification quality.

Confidence Tiers

When the LLM flags a classification issue, it assigns a confidence score. That score determines which tier the issue falls into:

Tier A

95%+

High confidence. Auto-fixed unless safety gate is active. Usually obvious misclassifications.

Tier B

70-95%

Needs human review. This is where your judgment matters most. The bulk of review queue items.

Tier C

<70%

Low confidence. Held for review but likely ambiguous. The LLM isn't sure, so neither should we be.

The Pipeline

Each eval run follows this sequence:

Golden Set Check

→

Location Eval

→

Function Eval

→

Non-English Detection

→

Dept Conflicts

→

Tier A Auto-Fix

→

Safety Gate

→

Report

Safety Gates

Multiple layers prevent wrong fixes from reaching production. If the golden set fails, the entire eval aborts. If too many Tier A auto-fixes accumulate without human review, the safety gate blocks further fixes until a human approves the backlog. Every approve or reject you do here feeds directly into the golden set to improve future accuracy.

LLM Fallback Chain

Classification calls cascade through multiple providers for resilience:

Tier 0 Azure OpenAI gpt-4o-mini (primary, $1K startup credits)

Tier 1 Claude Agent SDK (Max subscription, zero per-token)

Tier 2 AWS Bedrock

Tier 2.5 Azure Foundry

Tier 3 Anthropic API (spend-guarded $10/day)

How Your Feedback Improves the System

Every time you approve or reject an issue, the decision is recorded in the golden set. Approving means the suggested fix was correct. Rejecting means the current value was right and the LLM was wrong. Both actions prevent the same job from being re-flagged in future eval runs. Over time, the golden set grows organically from your reviews, making the system smarter.