CatchUp: Structuring the Unstructured

An LLM pipeline that transforms unstructured study materials — PDFs, notebooks, screenshots — into structured notes with auto-linked concepts.

Timeline Mar 2026 – present
Role Solo Developer
LLM Pipeline Design · VLM Prompt Engineering · Evaluation Framework
Tagline You study. CatchUp connects.
In Progress GitHub →
VLM LLM Pipeline Document Structuring RAG Prompt Engineering Evaluation

Students accumulate materials across formats — PDF lecture slides, Jupyter notebooks, screenshots of diagrams, handwritten notes captured on a phone. The common workflow is to drop each into ChatGPT for a quick summary. The summary is useful, but it disappears into the conversation history, disconnected from every other summary generated before it.

The result: each document is understood individually, but the relationships between them are invisible. You've studied Transformer, BERT, and GPT separately, but can't trace how self-attention behaves differently across all three — because each summary exists in its own silo.

You can see every tree, but the forest is missing.

CatchUp addresses the gap between per-document summarization and cross-document understanding. The pipeline parses unstructured inputs, generates structured notes per document, extracts canonical concepts, and connects them across your entire material library — turning isolated summaries into a searchable, linked knowledge base.

The core problem: Individual document summarization ≠ conceptual understanding. CatchUp's approach: unstructured input → structured notes → concept extraction → cross-document linking → searchable knowledge graph with RAG-based Q&A.

The pipeline handles three input formats — PDF, Jupyter notebook, and image — through format-specific parsers that converge into a unified document schema. From there, a single processing path handles analysis, note generation, concept extraction, and cross-document linking.

PDF DoclingLoader Notebook nbformat Image VLM Analysis Unified Document Schema Document → Blocks + metadata (source, page, type) VLM / LLM Analysis Image-type-specific prompts · text block structuring Note Generation Per-document structured notes · prompt versioning Concept Extraction & Cross-document Linking Canonical name normalization · backlink connections SQLite ChromaDB Streamlit UI Notes Browser Concept Map RAG Q&A Semantic Search (planned)
CatchUp pipeline: three input formats converge into a unified document schema, then flow through analysis, note generation, and concept linking. UI components on the right are planned.

Schema design: Every input format — PDF pages, notebook cells, images — is normalized into a Document → Block[] structure with source metadata (file path, page number, block type). This means downstream components — note generation, concept extraction, linking — never need to know the original format.

Study materials contain images that carry critical information — architecture diagrams, code screenshots, mathematical derivations, handwritten annotations. A text-only pipeline misses all of this. CatchUp uses vision-language models to convert visual content into structured text, with prompts tailored to each image type.

This is where my CV research background directly applies: the core challenge is the same problem I worked on in my thesis — extracting structured information from unstructured visual input — applied to a different domain.

Code

Code Screenshot

Extracted as a Markdown code block with language tag and inline comments preserved.

→ ```python ... ```
Diagram

Architecture / Flow

Structural description capturing components, relationships, and data flow direction.

→ structured description
Text

Text Capture

OCR-like extraction producing cleaned, formatted text preserving heading hierarchy.

→ cleaned markdown
Mixed

Mixed Content

Router classifies the primary type, then applies the corresponding specialized prompt.

→ primary type output
Other

Unclassifiable

Fallback for images that don't match known types. Generic prompt captures whatever is present.

→ generic description
VLM Comparison Experiment

A systematic comparison across 12 VLMs to identify the best cost-accuracy tradeoff for each image type. The experiment spans three commercial providers — including OpenAI's latest reasoning models — plus open-source baselines on RunPod GPU.

Provider Models Notes
OpenAI GPT-4o-mini legacy
GPT-4.1 nano nano · GPT-4.1 mini mini
GPT-5 nano nano · GPT-5 mini mini
GPT-5 has CoT reasoning — enables reasoning vs. non-reasoning comparison
Google Gemini 3.0 Flash budget
Gemini 3.1 Pro mid
1M context window; diagram interpretation strength
Anthropic Claude Haiku 4.5 budget
Claude Sonnet 4.6 mid
Best code parsing accuracy (SWE-bench)
Open-source Qwen2-VL 7B INT4 must
LLaVA-1.6 7B should · PaliGemma stretch
RunPod GPU; Qwen2-VL achieves 94.5% on DocVQA

The experiment evaluates eight axes — including two comparisons that no existing benchmark covers: whether CoT reasoning models (GPT-5) actually improve document layout parsing over non-reasoning models (GPT-4.1), and the quality gap between nano and mini tiers within the same generation (5–8× cost difference).

1

Cross-provider comparison

Same price tier across OpenAI / Google / Anthropic

2

Price tier comparison

Cost range spans ~150× ($0.02 – $3.00/1M input) — quality delta measured

3

Commercial vs. open-source

Cost / privacy / latency tradeoffs with Qwen2-VL INT4

4

Reasoning vs. non-reasoning

GPT-5 (CoT) vs. GPT-4.1 on document layout parsing — no existing benchmark covers this

5

Generational jump

GPT-4o-mini → GPT-4.1 → GPT-5 quality delta measurement

6

Nano vs. mini gap

Within the same generation, 5–8× cost difference vs. quality difference

7

Router classification accuracy

Image-type precision/recall including edge cases

8

Resolution optimization

Original / 1600px / 1024px / 512px × grayscale; cost and quality impact quantified

Gap vs. existing benchmarks: DocVQA and OCRBench measure short-answer extraction. This experiment targets markdown structure preservation (AST-parseable code output), diagram-to-text transformation (generative, not extractive), and Korean/English mixed IT domain — areas no public benchmark currently covers.

Results Coming Soon

Golden set of 15–25 images per type is being curated. Results will include per-type accuracy, cost breakdown, and recommended model assignment per image category.

1

Note Generation with Prompt Versioning

Each document passes through a structured note generation prompt that outputs section headers, key concepts, and source block references. Prompts are version-controlled (v1.0 → v1.1 → ...) with quality tracked per version, enabling systematic iteration rather than ad-hoc edits.

2

RAG Q&A with Source Citation

ChromaDB vector search retrieves relevant blocks, then LangChain generates answers with explicit citations — referencing the original block ID and page number. Users can trace every claim back to the source material.

3

Cost Optimization

Resolution resizing experiments across 4 levels (original / 1600px / 1024px / 512px) to find the minimum resolution that maintains accuracy. Grayscale conversion tested separately — important nuance: all major VLM providers bill by pixel dimensions, not color channels, so grayscale has zero cost impact but may affect quality (especially for diagrams where color carries semantic meaning). Cost per document tracked across model assignments.

Evaluation Framework
Axis 1 — VLM Accuracy

Type-specific metrics

Code: NED + AST parse success rate. Diagram: LLM-as-judge 3-axis rubric (node coverage, edge direction, hierarchy). Text: ANLS + reading order consistency. Golden set: 15–25 images with edge cases.

Axis 2 — Note Quality

Factual + structural

Three metrics: factual consistency (no hallucination), coverage (no missing concepts), and redundancy (no duplicate information). Cross-vendor LLM-as-judge to prevent single-vendor bias.

Axis 3 — Before / After

Pipeline value proof

Same 10–20 questions answered from raw document vs. CatchUp-structured notes. Measures whether structuring improves downstream Q&A quality. Kruskal-Wallis test for cross-model significance.

Evaluation In Progress

Evaluation framework is designed; golden set curation and scoring implementation are underway. Statistical testing (Kruskal-Wallis non-parametric) planned for small-sample model comparisons.

Edge Case Taxonomy

Empty / blank image

File size + entropy check before VLM call. Skips VLM, tags as "empty image."

Mixed-type content

Low classification confidence triggers multi-label handling. Primary type processed; secondary type preserved in metadata.

Unclassifiable input

OTHER fallback with generic text prompt — extracts whatever is present without forced classification.

Prompt injection (typographic)

System prompt data encapsulation (### USER DATA ###) + immunity prompting + output sanitization regex.

Low-resolution capture

Resolution threshold check with confidence warning; original image co-displayed alongside degraded extraction.

Diverse capture habits

Golden set includes varied styles (partial crops, screen photos, skewed angles). Router accuracy evaluated per style.

Completed
  • PDF parser (DoclingLoader — text + figure block extraction)
  • Jupyter notebook parser (nbformat — code / markdown / output cell separation)
  • Storage layer (SQLite metadata + ChromaDB vectors + JSONL API logging)
  • VLM client wrapper — 10 models across OpenAI, Google, Anthropic; unified interface with per-call cost tracking
  • VLM prompts v1.1 — type-specific prompts (vlm_code, vlm_diagram, vlm_text) with structured JSON output + confidence fields
  • Note generation prompts v1.4 — versioned iteration history with per-version quality delta tracked in VERSION_LOG.md
  • PR review automation (Claude-based code review pipeline)
In Progress
  • Image parser — VLM-based 5-class classification (code / diagram / text / equation / other) + type-specific routing
  • Note generation pipeline — end-to-end document → markdown study note
Planned
  • RAG Q&A (ChromaDB retrieval + LangChain RetrievalChain + source citation with block ID / page number)
  • Evaluation framework (golden set 15–25 docs + 3-axis scoring + Kruskal-Wallis test)
  • VLM comparison experiment (12 models × 5 image types × 8 analysis axes)
  • Concept extraction + canonical name normalization + cross-document backlinks
  • Streamlit UI (notes browser, concept map via pyvis, semantic search, RAG Q&A)
  • Langfuse observability (stage-level latency / token / cost dashboard)
  • Deployment (Streamlit Cloud + RunPod for open-source VLM inference)
Near-term Roadmap
Week 1
  • Image parser completion
  • Note generation v1
  • Golden set curation
Week 2
  • VLM experiment execution
  • RAG Q&A pipeline
  • Evaluation scoring loop
Week 3
  • Concept linking
  • Streamlit UI
  • Cost optimization
Parsing
DoclingLoader nbformat
VLM
GPT-4o-mini GPT-4.1 nano GPT-4.1 mini GPT-5 nano GPT-5 mini Gemini 3.0 Flash Gemini 3.1 Pro Claude Haiku 4.5 Claude Sonnet 4.6 Qwen2-VL 7B
LLM Pipeline
LangChain ChromaDB SQLite
Frontend
Streamlit pyvis
Observability
Langfuse (planned)
Infra
RunPod Streamlit Cloud
Part of a portfolio demonstrating the pivot from CV/multimodal research to LLM pipeline engineering. VLM is the bridge — not a separate skill, but a continuation of the same core expertise (unstructured visual data → structured output) applied to a new domain.
Back to All Projects