Welfare Compass — Production Service

Production rebuild of our hackathon prototype. Full-stack service with persistent user profiles, rule-based eligibility matching, and a LangGraph ReAct agent for conversational policy guidance.

Timeline Jan–Mar 2026
(ongoing)
Team 3 members
Backend · Frontend · LLM/AI
My Role LLM/Agent Pipeline Engineer
Prompt Engineering · Code Review Lead

The hackathon result was encouraging, but two observations made it clear that v1 had a fundamental design problem.

First, the judges: "Why make users re-enter the same information every time?" Every session started from scratch — age, income, employment status, all collected again through conversation. For a service meant to lower barriers to welfare access, requiring repetitive input was a friction we'd designed in.

Second, a discovery about the existing infrastructure: the government's Bokjiro welfare portal — the official reference system — relied on matching by civil servants manually reviewing applications. Even the authoritative solution didn't scale. This told us there was a real gap to fill, not just a hackathon showcase to polish.

v2 direction: persistent user profiles with auto-matching, a LangGraph ReAct agent for dynamic query handling, and a full Django + Next.js production stack — expanding from Seoul youth to all citizens.

The core shift from v1 to v2 is replacing a fixed sequential pipeline with a dynamic ReAct agent, and separating the matching and search concerns that were conflated in v1.

v1 — Hackathon v2 — Production
Architecture Streamlit · fixed sequential pipeline Django + Next.js · LangGraph ReAct agent
Agent design 5 nodes, always executes all in order Central orchestrator · 4 tools selected dynamically per turn
Matching Search → filter (search limits candidates) Rule-based full DB scan · 342 policies · ~50ms
Search role Used for eligibility matching Separated — comparison / FAQ / exploration only
LLM calls / request 5 ~3 (40% reduction for simple queries)
User profiles None — re-collected every session Persistent profiles · auto-matched at session start
Data ~100 manually curated policies 342 policies via Youth Policy API · daily sync
Eligibility logic LLM-assisted (probabilistic) Deterministic rule-based · no LLM in matching path
Key Technical Decisions
Finding

Fixed pipelines waste LLM calls on simple queries

A "hello" message in v1 still triggered the full extraction → search → response sequence — 5 LLM calls for a conversational greeting. The pipeline couldn't distinguish intent before committing resources.

Decision 1 — ReAct Agent over fixed pipeline

LangGraph orchestrator selects tools per turn

The orchestrator reads the conversation and decides which tools to invoke — skipping extraction when there's nothing to extract, skipping search when the query is a follow-up clarification.

→ 40% fewer LLM calls on casual/simple queries
Finding

Vector search with k=20 could miss qualifying policies

With 342 policies in the DB, a policy that qualifies but ranks semantically low (due to phrasing mismatch) would not appear in the top-20 results — silently excluded from eligibility matching.

Decision 2 — Separate matching from search

Rule-based full DB scan for eligibility · search for exploration

Eligibility matching runs a deterministic rule check across all 342 policies (~50ms, no LLM). Search is reserved separately for comparison queries, FAQ, and policy exploration — where ranking tradeoffs are acceptable.

→ Zero false negatives from search truncation in matching
Finding

LLM module tightly coupled to Django ORM

The eligibility checker needed to query the database, but importing Django ORM directly into the LLM module created framework coupling — making the tool untestable without a live database.

Decision 3 — Factory pattern for dependency injection

create_check_eligibility(fetcher) — inject the data source

The eligibility tool accepts a fetcher function at construction time. Production passes the ORM query; tests pass a stub — same tool, different data source.

→ Full integration test suite without live DB dependency
4

Embedded query rewriting inside search_policies

A standalone rewrite tool relied on the orchestrator calling it before search — which it occasionally skipped. Embedding rewrite_query inside search_policies makes query optimization run unconditionally whenever search is invoked, eliminating the coordination dependency.

5

Claude + Codex parallel development workflow

With one engineer handling LLM/agent design, solo development creates a bottleneck between architecture and implementation. Claude handles design, review, and architectural judgment; Codex handles implementation, testing, and PR generation from structured task specs. This parallelizes work that would otherwise be sequential — though it introduced a cross-validation gap that shaped later process improvements (see Challenges).

User Browser Next.js Frontend Django REST API DRF · PostgreSQL LangGraph ReAct Orchestrator GPT-4o-mini · dynamically selects tools per conversation turn extract_info Extracts user profile from conversation context search_policies Hybrid semantic + keyword search + rewrite_query (embedded) check_eligibility Rule-based full DB scan No LLM · 342 policies · ~50ms MCP Server BM25 (Kiwi) + Dense (Chroma) + BGE Reranker · RRF fusion PostgreSQL 342 policies · daily sync Youth Policy API
v2 system architecture — matching and search are fully separated. check_eligibility scans all 342 policies deterministically; search_policies uses hybrid retrieval only for exploration and FAQ.
Tech Stack
Backend
Django 5 DRF PostgreSQL Next.js
AI / Agent
LangGraph ReAct GPT-4o-mini OpenAI Embeddings
Search
BM25 (Kiwi tokenizer) Chroma BGE Reranker RRF
Infra / Dev
MCP LangFuse (planned) Claude Codex Jira
Completed
  • LangGraph ReAct agent with orchestrator prompt v2.2
  • 4-tool architecture: extract_info, search_policies, check_eligibility, rewrite_query (embedded)
  • Hybrid search pipeline: BM25 + Dense + BGE reranker via MCP server
  • Rule-based eligibility matching with factory pattern — full DB scan, 342 policies
  • Integration test suite: real LLM + stub tools for contract validation
  • Backend API, DB migrations, Smart Seoul Map V5 integration
  • Frontend base structure (Next.js)
In Progress
  • LangFuse observability (scoped as separate PR)
  • Streaming responses
  • Full user auth flow integration
  • Map display in chatbot (frontend rendering)
  • Evaluation dataset + scoring loop
Planned
  • Stage 0 welfare — crisis intent handling for users facing system entry barriers (no fixed address, no ID, emergency shelter needs). Designed but not yet implemented — would use a curated knowledge base separate from the policy DB.
  • Admin dashboard + usage analytics
  • Deployment
Challenge

Document-to-code cross-validation gap

During orchestrator redesign (BRAIN4-42), Claude reviewed document consistency and Codex executed implementation tasks — but neither caught field name mismatches between design docs and actual tool output schemas. A teammate's PR review surfaced the issue.

Solution

Explicit cross-validation step in task specs

Added a mandatory validation stage to all Codex task specs: before implementation, verify field names against the live tool output schemas. The retrospective also clarified which review tasks require human judgment versus LLM review.

Challenge

Inconsistent field naming across team

interests vs needs, region vs district, income vs income_level — different team members used different names for the same concepts, causing silent mismatches between the agent's extracted profile and the eligibility rule schema.

Solution

Canonical field names in orchestrator prompt v2.2

Established a single canonical field name dictionary in the orchestrator system prompt — the authoritative reference for all tool inputs and outputs. Code review now checks against this dictionary, and the task spec template includes a field-name verification step.

Challenge

Test coverage vs. infrastructure constraints

Full integration tests — real LLM calls against real tools — are slow, expensive, and sensitive to API latency. Running them on every commit was impractical, but skipping them risked missing contract-level regressions between the orchestrator and its tools.

Solution

Two-track testing strategy

Track 1 (integration_orchestrator): real LLM + stub tools — validates orchestrator logic and tool-calling contracts without infrastructure cost. Run on every PR. Track 2 (integration_live): real LLM + real tools — kept minimal, run on targeted scenarios before major releases.

Welfare Compass v1 demonstrated the concept was viable — placing Top 20 of 181 teams, with judges noting it as something citizens could use right away. v2 builds on that validated foundation with specific design goals that address what the hackathon prototype couldn't.

Back to All Projects