Map-once, score-forever

Chris Kirilov · chriskirilov.com · OM system design notes

OM converts an O(users × jobs) LLM ranking problem into O(users + jobs) plus pure math. The cost economics that follow — $40–$250 vs $25K–$50K at the same scale — are the difference between a system that grows and one that can't.

This page is the architecture of how. It assumes you've seen the case study and want the engineering layer underneath.

01The cost wall every brute-force LLM ranker hits

The natural way to score (user, job) fit with an LLM is one call per pair: prompt the model with the user profile and the job description, ask for a fit score. It works. It's also wrong, because the cost scales with the cross product.

OM serves a daily feed against an active corpus of 88,125 jobs. At one user that's already 88K LLM calls per scoring cycle if naive. Project to the load profile the system is architected for — 1,000 users × 5,000 daily-relevant jobs each — and the brute-force approach lands at five million LLM calls per cycle. At Haiku rates with prompt caching, that's $25K per cycle. Daily.

Approach	LLM calls / cycle	Cost / cycle
Brute force (Haiku per pair)	5,000,000	$25K — $50K
Graph (map-once, math forever)	~2,000	$40 — $250

250× to 1,000× delta. The shape of the gap matters more than the exact number — at brute-force economics the system can't grow, because every additional user multiplies the LLM bill against every job in the corpus. At graph economics, adding a user costs one Sonnet call once. The architecture is the cost story.

The deeper point — quality follows the cost shape. Brute-force pairwise scoring is also worse, because each call has no shared vocabulary across users or jobs. Two LLM calls scoring the same job for two different users use different mental ontologies for "Python" or "RevOps experience." The graph is a normalized representation: same node space for every user, every job. That's what makes the scores comparable in the first place.

02The graph is the data structure that makes decomposition possible

The decomposition only works if there's a shared vocabulary between users and jobs that's stable, queryable, and cheaper to traverse than to regenerate. That's the graph.

Nodes	186
Edges	337
Hierarchy	L0 cluster → L1 sub-competency
Edge types	composition, dependency, transfer
Transfer weights	0.3 – 0.8
Seed cost	~$1, paid once

The graph was seeded by running Sonnet against the top 50 GTM job descriptions in two passes — nodes first, then edges. The taxonomy that emerged is two-level: L0 clusters (broad competency areas like "data engineering" or "GTM ops"), L1 sub-competencies (specific skills like "dbt modeling" or "Salesforce admin").

Edges carry semantics, not just topology. A composition edge means a sub-skill rolls up into a parent (dbt → analytics engineering). A dependency edge means having A is required to do B (SQL → dbt). A transfer edge means A partially advances B with a learned weight (Python → JavaScript at 0.4, because some skills transfer and some don't).

Why a hand-curated graph instead of vector embeddings, the obvious alternative. Embeddings would handle entity expansion automatically — a new skill mentioned in a JD would land somewhere in vector space without an explicit node. Powerful when the entity space is large or evolving. The GTM competency space is neither. 186 nodes covers the actual landscape, and every node carries semantic meaning a recruiter or operator can read. The graph is opinionated, explainable, and the scoring it produces can be audited node by node. Embeddings give up explainability for generalization OM doesn't need.

03The mapping operations amortize

The graph only enables decomposition because users and jobs both map into it once and stay mapped. The mapping operations are where the LLM cost lives — and they're paid once per entity, not once per pair.

Mapping a user

One Sonnet call, run when a user creates their profile or updates their resume. Input: the resume, profile context, and the 186-node taxonomy. Output: a list of {node_id, depth: 0–100, confidence: 0–1, evidence} tuples. A typical user lights up 80–150 nodes.

Then propagate_skills() runs pure-math edge traversal to infer adjacencies. If a user has dbt at depth 70 and the graph has a transfer edge from dbt → Snowflake at weight 0.6, the user gets Snowflake at depth ~42 with reduced confidence. Propagation runs at map time, not score time. The inference is cached as user state.

Cost per user map: $0.03 (Sonnet 4.5, ~7K input + 2.5K output tokens, prompt-cached system prefix).

Mapping a job

Four-stage hybrid: rules-based extraction, fuzzy match against the graph, dictionary lookup, LLM-on-residual. About 80% of jobs land entirely in the deterministic stages — the JD says "dbt," the rules layer matches dbt, no LLM call needed. The remaining 20% hit Haiku for the residual nodes.

Output per job: 15–30 nodes with {node_id, required_depth, is_critical, differentiation_power}. Required depth is how senior the role needs in that skill. Critical flags whether the skill is hard-required vs nice-to-have. Differentiation power is how much that skill distinguishes the role from the median JD in the same L0 cluster.

Cost per job map: $0.001 average — most jobs are entirely deterministic. The 20% that hit Haiku run at ~3K input + 1K output tokens.

04Scoring is pure arithmetic

At query time — every feed query, every triage call, every batch update — there are zero LLM calls. The score is computed from the mapped representations.

The compute_capability_fit() function:

for each required_node in job.requirements:
    user_depth = user_state[required_node.id].depth
    node_score = min(user_depth / required_node.required_depth, 1.0)
    weight = (
        type_multiplier *
        required_node.differentiation_power *
        idf_rarity[required_node.id]
    )
    weighted_sum += node_score * weight
    total_weight += weight

raw_score = weighted_sum / total_weight

# coverage penalty: if <50% of critical nodes met, attenuate
if critical_coverage < 0.5:
    raw_score *= critical_coverage * 2

# domain-critical-non-transferable: 3× weight, gated by confidence
# rescale to 1–10 for the feed
final_score = rescale(raw_score, 1, 10)

Three things to notice. First, the score is interpretable — it can be decomposed back into per-node contributions, so a user can see exactly why a job ranked where it ranked. Second, the IDF rarity term means a rare skill match is worth more than a common one (matching "Tableau" is everywhere; matching "graph database for sales ops" is signal). Third, the coverage penalty prevents a job from looking like a great fit when only one or two of its critical requirements are met.

The whole scoring pass over 88K jobs runs in under a second of CPU time. No network calls. No LLM. Just rows in SQLite intersected against rows in user state.

05The supporting architecture, briefly

The architecture exists in service of the insight above. It's worth knowing but not the headline.

signal flow: sources → ranked feed via Flask

Ingestion

Registry-driven, not crawler-driven. company_boards is the canonical source — every scrape decision is a row, not a hard-coded list. 1,667 boards across 8 ATS platforms (Greenhouse 824, Ashby 590, Lever 188, SmartRecruiters 65 in production; Workday, Teamtailor, BambooHR, iCIMS code paths exist for the next wave). Plus HN, Jobicy, Gmail alerts, and a Chrome extension for one-click paste from any job page.

Each ATS client maps the platform's native payload to a uniform dict: {url, title, company, jd_text, ats_type, ats_job_id, location, apply_url, salary_min/max}. Greenhouse and Ashby return clean JSON. Workday needs {wd, site} params extracted from the URL. iCIMS branches on portal_type. By the time it hits the upsert, the schema is platform-agnostic.

Operating cadence — 88,125 active jobs, 184K cumulative evaluations, 600 LLM calls per day at sub-$1 daily inference cost. ATS API polling at 6am and 6pm. Email ingestion every 30 minutes. Company-discovery sweep weekly. Throughput peaks at 1,540 jobs ingested in a heavy day.

Operator-level texture worth knowing: the retry set is the most-criticized piece — 429s aren't in it (real bug, that's the actual rate-limit code; Workday drops boards silently when its quota fires), and 403 currently is in it (also wrong — 403 means the server understood and refused, retrying just burns calls). ETag/304 conditional GETs are Greenhouse-only because that's where I implemented it; the others re-fetch daily until I extend the client. Auth is per-route via decorators (Bearer for /api/*, @login_required for web routes), not central middleware — will move to a Flask blueprint with auth applied at registration before another engineer touches the surface.

Storage

SQLite, single file, 480MB on disk. Multi-process write contention is solved by WAL mode plus the activity table's deterministic event_key dedup. 88K jobs and 184K evaluations sits well under SQLite's analytical comfort zone.

The hot indexes:

opportunities(url) UNIQUE — drives every ingest upsert
opportunities(is_active, created_at DESC) — drives feed query
evaluation_pipeline(user_id, status, adjusted_score DESC) — drives ranked feed by funnel stage
opportunity_activity(event_key) UNIQUE — drives activity dedup
slug_validation_log(slug, validated_at) — drives 30-day TTL on negative cache

No vector store, no Redis. The "vector" is user_skill_state rows joined against opportunity_skill_requirements rows — relational and indexed. Caching is in-process for IDF rarity values and Anthropic-side for system prompt prefixes.

Backup is cp. Migration is a script. The simplicity is the feature, until the load profile changes — at which point the queue worker pool becomes the dominant writer and contention forces the move to Postgres. That's a same-schema migration with the application layer already idempotent.

Observability

The strongest piece is the api_usage table — every LLM call logged with model, function type, input/output/cache tokens, computed USD cost, batch flag, success flag. get_api_usage_summary() gives 30d / 7d / daily breakdowns by function and model. Spend at any level of granularity is one SQL query away.

Beyond that — module-level Python logging across 15 files. log_activity() persists user-visible events. pipeline_health() computes a 0–100 user-pipeline score (applied / interviewing / offer counts, conversion rates, stale apps over 7 days). What's missing — central log aggregation, metrics dashboards, anomaly detection, cost circuit breakers. Spend is tracked, not enforced. The /health endpoint returns hardcoded {status: ok, version}.

Where the architecture grows

Two architectural ceilings, both load-driven, both visible from where the system sits.

Concurrent writes to SQLite. The single-writer model holds while ingest plus async enrichment is the only contention surface. Once a queue worker pool fans out the ingest path under sustained load, contention exceeds what WAL handles gracefully. Migration is straightforward — Postgres with the same schema, application layer already idempotent. This is the first wall the system will hit.

LLM prompt size for user mapping. Sonnet's context is generous, but the user-mapping prompt scales linearly with graph node count. At ~500 nodes the per-call cost crosses the threshold where prompt-cache hit-rate degrades meaningfully. The mitigation is hierarchical mapping: L0 cluster pass first, then L1 within matched clusters. Same architecture, same scoring math, smaller per-call surface. Resume generation hits a parallel ceiling and crosses it the same way — Anthropic Batch API for 50% cost reduction, plus prompt-prefix caching across the user's resume context (~$0.05 → ~$0.025 per resume).

Long-tail: slug validation cache. 31,000 rows, 95.9% not_found. Linear growth is the negative-cache pattern — at order-of-magnitude growth the table becomes mostly noise. Bloom filter for the negative path bounds cache size while preserving fast rejection.

The architecture earns the right to grow into these moves because the abstractions hold. Map-once means a queue-based ingest path doesn't change the scoring layer. The graph-based representation means hierarchical mapping is a refactor of one function, not the system. The cost ledger means budget enforcement is one circuit breaker away from the existing tracking.

Map-once, score-forever. The architecture is the cost story.