CatchUp — Document Structuring Pipeline

Problem & Motivation

Students accumulate materials across formats — PDF lecture slides, Jupyter notebooks, screenshots of diagrams, handwritten notes captured on a phone. The common workflow is to drop each into ChatGPT for a quick summary. The summary is useful, but it disappears into the conversation history, disconnected from every other summary generated before it.

The result: each document is understood individually, but the relationships between them are invisible. You've studied Transformer, BERT, and GPT separately, but can't trace how self-attention behaves differently across all three — because each summary exists in its own silo.

You can see every tree, but the forest is missing.

CatchUp addresses the gap between per-document summarization and cross-document understanding. The pipeline parses unstructured inputs, generates structured notes per document, extracts canonical concepts, and connects them across your entire material library — turning isolated summaries into a searchable, linked knowledge base.

The core problem: Individual document summarization ≠ conceptual understanding. CatchUp's approach: unstructured input → structured notes → concept extraction → cross-document linking → searchable knowledge graph with RAG-based Q&A.

Architecture & Pipeline

The pipeline handles three input formats — PDF, Jupyter notebook, and image — through format-specific parsers that converge into a unified document schema. From there, a single processing path handles analysis, note generation, concept extraction, and cross-document linking.

      CatchUp pipeline: three input formats converge into a unified document schema, then flow through analysis, note generation, and concept linking. UI components on the right are planned.
    

Schema design: Every input format — PDF pages, notebook cells, images — is normalized into a Document → Block[] structure with source metadata (file path, page number, block type). This means downstream components — note generation, concept extraction, linking — never need to know the original format.

VLM Strategy

Study materials contain images that carry critical information — architecture diagrams, code screenshots, mathematical derivations, handwritten annotations. A text-only pipeline misses all of this. CatchUp uses vision-language models to convert visual content into structured text, with prompts tailored to each image type.

This is where my CV research background directly applies: the core challenge is the same problem I worked on in my thesis — extracting structured information from unstructured visual input — applied to a different domain.

Code

Code Screenshot

Extracted as a Markdown code block with language tag and inline comments preserved.

→ ```python ... ```

Diagram

Architecture / Flow

Structural description capturing components, relationships, and data flow direction.

→ structured description

Text

Text Capture

OCR-like extraction producing cleaned, formatted text preserving heading hierarchy.

→ cleaned markdown

Mixed

Mixed Content

Router classifies the primary type, then applies the corresponding specialized prompt.

→ primary type output

Other

Unclassifiable

Fallback for images that don't match known types. Generic prompt captures whatever is present.

→ generic description

VLM Comparison Experiment

A systematic comparison across 12 VLMs to identify the best cost-accuracy tradeoff for each image type. The experiment spans three commercial providers — including OpenAI's latest reasoning models — plus open-source baselines on RunPod GPU.

Provider	Models	Notes
OpenAI	GPT-4o-mini legacy GPT-4.1 nano nano · GPT-4.1 mini mini GPT-5 nano nano · GPT-5 mini mini	GPT-5 has CoT reasoning — enables reasoning vs. non-reasoning comparison
Google	Gemini 3.0 Flash budget Gemini 3.1 Pro mid	1M context window; diagram interpretation strength
Anthropic	Claude Haiku 4.5 budget Claude Sonnet 4.6 mid	Best code parsing accuracy (SWE-bench)
Open-source	Qwen2-VL 7B INT4 must LLaVA-1.6 7B should · PaliGemma stretch	RunPod GPU; Qwen2-VL achieves 94.5% on DocVQA

The experiment evaluates eight axes — including two comparisons that no existing benchmark covers: whether CoT reasoning models (GPT-5) actually improve document layout parsing over non-reasoning models (GPT-4.1), and the quality gap between nano and mini tiers within the same generation (5–8× cost difference).

1

Cross-provider comparison

Same price tier across OpenAI / Google / Anthropic

2

Price tier comparison

Cost range spans ~150× ($0.02 – $3.00/1M input) — quality delta measured

3

Commercial vs. open-source

Cost / privacy / latency tradeoffs with Qwen2-VL INT4

4

Reasoning vs. non-reasoning

GPT-5 (CoT) vs. GPT-4.1 on document layout parsing — no existing benchmark covers this

5

Generational jump

GPT-4o-mini → GPT-4.1 → GPT-5 quality delta measurement

6

Nano vs. mini gap

Within the same generation, 5–8× cost difference vs. quality difference

7

Router classification accuracy

Image-type precision/recall including edge cases

8

Resolution optimization

Original / 1600px / 1024px / 512px × grayscale; cost and quality impact quantified

Gap vs. existing benchmarks: DocVQA and OCRBench measure short-answer extraction. This experiment targets markdown structure preservation (AST-parseable code output), diagram-to-text transformation (generative, not extractive), and Korean/English mixed IT domain — areas no public benchmark currently covers.

Results Coming Soon

Golden set of 15–25 images per type is being curated. Results will include per-type accuracy, cost breakdown, and recommended model assignment per image category.

LLM Pipeline Engineering

1

Note Generation with Prompt Versioning

Each document passes through a structured note generation prompt that outputs section headers, key concepts, and source block references. Prompts are version-controlled (v1.0 → v1.1 → ...) with quality tracked per version, enabling systematic iteration rather than ad-hoc edits.

2

RAG Q&A with Source Citation

ChromaDB vector search retrieves relevant blocks, then LangChain generates answers with explicit citations — referencing the original block ID and page number. Users can trace every claim back to the source material.

3

Cost Optimization

Resolution resizing experiments across 4 levels (original / 1600px / 1024px / 512px) to find the minimum resolution that maintains accuracy. Grayscale conversion tested separately — important nuance: all major VLM providers bill by pixel dimensions, not color channels, so grayscale has zero cost impact but may affect quality (especially for diagrams where color carries semantic meaning). Cost per document tracked across model assignments.

Evaluation Framework

Axis 1 — VLM Accuracy

Type-specific metrics

Code: NED + AST parse success rate. Diagram: LLM-as-judge 3-axis rubric (node coverage, edge direction, hierarchy). Text: ANLS + reading order consistency. Golden set: 15–25 images with edge cases.

Axis 2 — Note Quality

Factual + structural

Three metrics: factual consistency (no hallucination), coverage (no missing concepts), and redundancy (no duplicate information). Cross-vendor LLM-as-judge to prevent single-vendor bias.

Axis 3 — Before / After

Pipeline value proof

Same 10–20 questions answered from raw document vs. CatchUp-structured notes. Measures whether structuring improves downstream Q&A quality. Kruskal-Wallis test for cross-model significance.

Evaluation In Progress

Evaluation framework is designed; golden set curation and scoring implementation are underway. Statistical testing (Kruskal-Wallis non-parametric) planned for small-sample model comparisons.

Edge Case Taxonomy

Empty / blank image

File size + entropy check before VLM call. Skips VLM, tags as "empty image."

Mixed-type content

Low classification confidence triggers multi-label handling. Primary type processed; secondary type preserved in metadata.

Unclassifiable input

OTHER fallback with generic text prompt — extracts whatever is present without forced classification.

Prompt injection (typographic)

System prompt data encapsulation (### USER DATA ###) + immunity prompting + output sanitization regex.

Low-resolution capture

Resolution threshold check with confidence warning; original image co-displayed alongside degraded extraction.

Diverse capture habits

Golden set includes varied styles (partial crops, screen photos, skewed angles). Router accuracy evaluated per style.

Progress & Roadmap

Completed

PDF parser (DoclingLoader — text + figure block extraction)
Jupyter notebook parser (nbformat — code / markdown / output cell separation)
Storage layer (SQLite metadata + ChromaDB vectors + JSONL API logging)
VLM client wrapper — 10 models across OpenAI, Google, Anthropic; unified interface with per-call cost tracking
VLM prompts v1.1 — type-specific prompts (vlm_code, vlm_diagram, vlm_text) with structured JSON output + confidence fields
Note generation prompts v1.4 — versioned iteration history with per-version quality delta tracked in VERSION_LOG.md
PR review automation (Claude-based code review pipeline)

In Progress

Image parser — VLM-based 5-class classification (code / diagram / text / equation / other) + type-specific routing
Note generation pipeline — end-to-end document → markdown study note

Planned

RAG Q&A (ChromaDB retrieval + LangChain RetrievalChain + source citation with block ID / page number)
Evaluation framework (golden set 15–25 docs + 3-axis scoring + Kruskal-Wallis test)
VLM comparison experiment (12 models × 5 image types × 8 analysis axes)
Concept extraction + canonical name normalization + cross-document backlinks
Streamlit UI (notes browser, concept map via pyvis, semantic search, RAG Q&A)
Langfuse observability (stage-level latency / token / cost dashboard)
Deployment (Streamlit Cloud + RunPod for open-source VLM inference)

Near-term Roadmap

Week 1

Image parser completion
Note generation v1
Golden set curation

Week 2

VLM experiment execution
RAG Q&A pipeline
Evaluation scoring loop

Week 3

Concept linking
Streamlit UI
Cost optimization

Tech Stack & Links

Parsing

DoclingLoader nbformat

VLM

GPT-4o-mini GPT-4.1 nano GPT-4.1 mini GPT-5 nano GPT-5 mini Gemini 3.0 Flash Gemini 3.1 Pro Claude Haiku 4.5 Claude Sonnet 4.6 Qwen2-VL 7B

LLM Pipeline

LangChain ChromaDB SQLite

Frontend

Streamlit pyvis

Observability

Langfuse (planned)

Infra

RunPod Streamlit Cloud

View on GitHub →

Part of a portfolio demonstrating the pivot from CV/multimodal research to LLM pipeline engineering. VLM is the bridge — not a separate skill, but a continuation of the same core expertise (unstructured visual data → structured output) applied to a new domain.

CatchUp: Structuring the Unstructured