Welfare Compass — Production Service

Context: Why v2?

The hackathon result was encouraging, but two observations made it clear that v1 had a fundamental design problem.

First, the judges: "Why make users re-enter the same information every time?" Every session started from scratch — age, income, employment status, all collected again through conversation. For a service meant to lower barriers to welfare access, requiring repetitive input was a friction we'd designed in.

Second, a discovery about the existing infrastructure: the government's Bokjiro welfare portal — the official reference system — relied on matching by civil servants manually reviewing applications. Even the authoritative solution didn't scale. This told us there was a real gap to fill, not just a hackathon showcase to polish.

v2 direction: persistent user profiles with auto-matching, a LangGraph ReAct agent for dynamic query handling, and a full Django + Next.js production stack — expanding from Seoul youth to all citizens.

v2 Architecture

The core shift from v1 to v2 is replacing a fixed sequential pipeline with a dynamic ReAct agent, and separating the matching and search concerns that were conflated in v1.

	v1 — Hackathon	v2 — Production
Architecture	Streamlit · fixed sequential pipeline	Django + Next.js · LangGraph ReAct agent
Agent design	5 nodes, always executes all in order	Central orchestrator · 4 tools selected dynamically per turn
Matching	Search → filter (search limits candidates)	Rule-based full DB scan · 342 policies · ~50ms
Search role	Used for eligibility matching	Separated — comparison / FAQ / exploration only
LLM calls / request	5	~3 (40% reduction for simple queries)
User profiles	None — re-collected every session	Persistent profiles · auto-matched at session start
Data	~100 manually curated policies	342 policies via Youth Policy API · daily sync
Eligibility logic	LLM-assisted (probabilistic)	Deterministic rule-based · no LLM in matching path

Key Technical Decisions

Finding

Fixed pipelines waste LLM calls on simple queries

A "hello" message in v1 still triggered the full extraction → search → response sequence — 5 LLM calls for a conversational greeting. The pipeline couldn't distinguish intent before committing resources.

Decision 1 — ReAct Agent over fixed pipeline

LangGraph orchestrator selects tools per turn

The orchestrator reads the conversation and decides which tools to invoke — skipping extraction when there's nothing to extract, skipping search when the query is a follow-up clarification.

→ 40% fewer LLM calls on casual/simple queries

Finding

Vector search with k=20 could miss qualifying policies

With 342 policies in the DB, a policy that qualifies but ranks semantically low (due to phrasing mismatch) would not appear in the top-20 results — silently excluded from eligibility matching.

Decision 2 — Separate matching from search

Rule-based full DB scan for eligibility · search for exploration

Eligibility matching runs a deterministic rule check across all 342 policies (~50ms, no LLM). Search is reserved separately for comparison queries, FAQ, and policy exploration — where ranking tradeoffs are acceptable.

→ Zero false negatives from search truncation in matching

Finding

LLM module tightly coupled to Django ORM

The eligibility checker needed to query the database, but importing Django ORM directly into the LLM module created framework coupling — making the tool untestable without a live database.

Decision 3 — Factory pattern for dependency injection

create_check_eligibility(fetcher) — inject the data source

The eligibility tool accepts a fetcher function at construction time. Production passes the ORM query; tests pass a stub — same tool, different data source.

→ Full integration test suite without live DB dependency

4

Embedded query rewriting inside search_policies

A standalone rewrite tool relied on the orchestrator calling it before search — which it occasionally skipped. Embedding rewrite_query inside search_policies makes query optimization run unconditionally whenever search is invoked, eliminating the coordination dependency.

5

Claude + Codex parallel development workflow

With one engineer handling LLM/agent design, solo development creates a bottleneck between architecture and implementation. Claude handles design, review, and architectural judgment; Codex handles implementation, testing, and PR generation from structured task specs. This parallelizes work that would otherwise be sequential — though it introduced a cross-validation gap that shaped later process improvements (see Challenges).

      v2 system architecture — matching and search are fully separated. check_eligibility scans all 342 policies deterministically; search_policies uses hybrid retrieval only for exploration and FAQ.
    

Tech Stack

Backend

Django 5 DRF PostgreSQL Next.js

AI / Agent

LangGraph ReAct GPT-4o-mini OpenAI Embeddings

Search

BM25 (Kiwi tokenizer) Chroma BGE Reranker RRF

Infra / Dev

MCP LangFuse (planned) Claude Codex Jira

Progress

Completed

LangGraph ReAct agent with orchestrator prompt v2.2
4-tool architecture: extract_info, search_policies, check_eligibility, rewrite_query (embedded)
Hybrid search pipeline: BM25 + Dense + BGE reranker via MCP server
Rule-based eligibility matching with factory pattern — full DB scan, 342 policies
Integration test suite: real LLM + stub tools for contract validation
Backend API, DB migrations, Smart Seoul Map V5 integration
Frontend base structure (Next.js)

In Progress

LangFuse observability (scoped as separate PR)
Streaming responses
Full user auth flow integration
Map display in chatbot (frontend rendering)
Evaluation dataset + scoring loop

Planned

Stage 0 welfare — crisis intent handling for users facing system entry barriers (no fixed address, no ID, emergency shelter needs). Designed but not yet implemented — would use a curated knowledge base separate from the policy DB.
Admin dashboard + usage analytics
Deployment

v2 Challenges

Challenge

Document-to-code cross-validation gap

During orchestrator redesign (BRAIN4-42), Claude reviewed document consistency and Codex executed implementation tasks — but neither caught field name mismatches between design docs and actual tool output schemas. A teammate's PR review surfaced the issue.

Solution

Explicit cross-validation step in task specs

Added a mandatory validation stage to all Codex task specs: before implementation, verify field names against the live tool output schemas. The retrospective also clarified which review tasks require human judgment versus LLM review.

Challenge

Inconsistent field naming across team

interests vs needs, region vs district, income vs income_level — different team members used different names for the same concepts, causing silent mismatches between the agent's extracted profile and the eligibility rule schema.

Solution

Canonical field names in orchestrator prompt v2.2

Established a single canonical field name dictionary in the orchestrator system prompt — the authoritative reference for all tool inputs and outputs. Code review now checks against this dictionary, and the task spec template includes a field-name verification step.

Challenge

Test coverage vs. infrastructure constraints

Full integration tests — real LLM calls against real tools — are slow, expensive, and sensitive to API latency. Running them on every commit was impractical, but skipping them risked missing contract-level regressions between the orchestrator and its tools.

Solution

Two-track testing strategy

Track 1 (integration_orchestrator): real LLM + stub tools — validates orchestrator logic and tool-calling contracts without infrastructure cost. Run on every PR. Track 2 (integration_live): real LLM + real tools — kept minimal, run on targeted scenarios before major releases.

Design Goals

Welfare Compass v1 demonstrated the concept was viable — placing Top 20 of 181 teams, with judges noting it as something citizens could use right away. v2 builds on that validated foundation with specific design goals that address what the hackathon prototype couldn't.

Zero re-entry. Persistent user profiles mean users never repeat themselves across sessions. The agent starts each conversation from what it already knows, supplementing through dialogue only when information is missing or outdated.
No missed policies. Rule-based full-DB matching ensures that every policy a user qualifies for surfaces — regardless of how they phrase their query. Eligibility is a deterministic check, not a ranking problem.
Separation of concerns for reliability. Matching and search serve different purposes and have different correctness requirements. Keeping them separate makes each independently testable and prevents search ranking tradeoffs from affecting eligibility outcomes.
Designed for scale. The factory pattern, two-track testing, and MCP-based search server are architectural choices that anticipate expansion — more policies, more citizen categories, more query types — without requiring a redesign.