Prompt Engineering a 'Simple' Summarization Pipeline

TL;DR: I built a Chrome extension that generates a clickable table of contents from AI chat conversations. Summarization sounds trivial — it's the "hello world" of LLM applications. In practice, making it reliable across conversations ranging from 1K to 100K+ tokens required real prompt engineering: a task-based eval framework, structured input compression that cut input tokens by 63%, dynamic length budgets, and a prompt sandwich technique for long-context instruction following. This post covers the techniques that moved the needle and the eval-driven workflow that made each change trustworthy.

The Task: Harder Than It Looks
Eval First: Building the North Star
Technique 1: Structured Input Compression
Technique 2: Dynamic Budget Injection
Technique 3: The Prompt Sandwich
Technique 4: Anchoring Semantics — Think UX, Not Source
The Results
- Model Selection
The Prompt Evolution At A Glance
Takeaways

The Task: Harder Than It Looks

Chat Navigator is a Chrome extension I'm building. It reads your AI chat conversations — ChatGPT, Claude, Gemini, DeepSeek — and generates a clickable table of contents. An outline you can use to jump back to specific topics in long threads.

The core pipeline is simple: messages in, structured TOC out. A single LLM call.

But the quality bar is specific:

Coverage: capture important conversation pivots, not just early topics
Anchor accuracy: each TOC entry has a refId pointing to a specific message — clicking it should jump to the right place
Label quality: topic labels need to be concise, scannable, and recognizable
Length control: a 5-message chat shouldn't produce 12 topics, and a 200-message chat shouldn't produce 3

"Summarize this conversation" is the hello world of LLM applications. "Generate a hierarchical outline with accurate jump anchors, dynamic length, and consistent quality across 1K–100K token inputs" is a different task entirely.

I spent about three days (Dec 30, 2025 – Jan 1, 2026) doing concentrated prompt engineering on this pipeline, running over 20 eval runs across 8 pipeline versions. This post distills what I learned into the techniques that actually mattered.

Eval First: Building the North Star

The biggest unlock wasn't a prompt technique. It was building evals before I started iterating on prompts.

This sounds obvious in retrospect, but I've seen many prompt engineering efforts (including my own earlier work) where the loop is: change the prompt, eyeball a few outputs, decide if it "feels better." That works for about two iterations before you lose track of whether you're actually improving.

Task-Based Evaluation, Not String Similarity

I didn't have gold-standard TOCs to compare against. And even if I did, string similarity against a reference outline would be a poor proxy — there are many valid ways to outline the same conversation.

Instead, I designed navigation tasks. Each conversation in my eval dataset (26 conversations covering a wide difficulty range) has a set of nav_tasks: things a user might want to find when revisiting the conversation. Each task has:

A description (e.g., "find where they discussed the tradeoff between latency and accuracy")
A target_refId — the message where this topic lives
An importance weight (1, 2, or 3)

The LLM judge evaluates: can a user find this task in the TOC? If they can, does the anchor jump to the right place?

This maps directly to what matters for the product. I don't care if the outline uses different words than some reference — I care whether a user can navigate with it.

The Judge Prompt

The judge scores along six dimensions, weighted to reflect product priorities:

task_coverage dominates at 45% because it IS the product: can users find what they're looking for? anchor_accuracy at 20% is next because jumping to the wrong message is a broken experience. label_quality and structure_quality at 15% each matter for scannability. conciseness and faithfulness are small guardrails.

The judge uses a matching protocol with explicit tolerance bands for anchoring:

Grade	Distance	Description
exact	`refId == target`	Perfect anchor
near	`\|refId - target\| <= 2`	Close enough
acceptable	`\|refId - target\| <= 5`	Usable but imprecise
far	`> 5`	Treat as weak hit or miss

Each task is scored as hit (1.0), weak_hit (0.5), or miss (0.0), weighted by importance.

One design choice I'm particularly happy with: the judge outputs structured task_results with per-task reasoning. Here's a condensed excerpt of the output schema:

{
  "task_results": [
    {
      "task_id": "T1",
      "importance": 3,
      "outcome": "hit",
      "matched_path": "Authentication Setup > JWT Token Strategy",
      "anchor_grade": "near",
      "label_grade": "good",
      "notes": "Label is specific and refId lands 1 message from target"
    }
  ],
  "top_fixes": [
    "Add a subtopic for the Redis caching decision at message #47",
    "Split 'Backend Architecture' — it covers 3 unrelated topics"
  ]
}

When a score drops, I can look at which specific tasks went from hit to miss, and why. This turned vague "the score went down" into actionable "task T7 regressed because the model collapsed two topics into one generic bucket." The top_fixes field gives me concrete next steps.

Program-Based Guardrails

Alongside the LLM judge, I tracked three deterministic metrics:

length_top_topics: how close the topic count is to the target
length_total_nodes: how close total outline nodes are to the target
compression_ratio: input tokens / output label tokens

These catch failure modes the judge might miss: "good coverage but the outline is way too long" or "decent labels but half the conversation is missing." They're cheap to compute and don't burn API credits.

A key design choice: the eval length targets and the pipeline's own prompt targets are derived by the same outlineBudget() function. This prevents evaluator/pipeline drift — the eval never scores against targets the pipeline didn't know about.

The Iteration Loop

Every prompt change went through:

Look at per-task reasoning logs — which tasks are misses?
Patch the prompt, schema, or input pipeline
Rerun the eval
Check: did the specific misses improve? Did anything else regress?

This is what made every technique in the rest of this post trustworthy. Without it, I'd have no way to distinguish "this feels better on the two examples I checked" from "this actually improved across 26 diverse conversations."

Technique 1: Structured Input Compression

Here's a number: between pipeline v3.1 and v3.7, average input tokens per LLM call dropped from 31,704 to 11,554 — a 63% reduction.

The mechanism is a function called extractMarkdownStructure. Before the conversation hits the LLM, each long assistant message gets compressed:

Headers are always preserved — they're the strongest structural signal for outlining
List items are sampled — keep the first N items, add a truncation marker
Code blocks and tables are replaced with brief markers
Long paragraphs are truncated to a token budget (prefix + suffix)

The compression level is tunable. I ran ablations on:

How many tokens to keep per paragraph
Whether to keep 1 or 3 list items
How aggressively to drop tables vs. keep them

63% was the sweet spot — I could go more aggressive, but quality starts to degrade beyond that.

This is a triple win:

Cost: fewer input tokens = lower API cost per call
Latency: fewer input tokens = faster time to first token
Quality: less noise for the model's attention mechanism = slightly better output

The third point is consistent but not magic — it's just less distraction. A 50,000-token conversation with full code blocks, tables, and multi-paragraph explanations gives the model more opportunity to lose focus. Strip it down to structural signal, and the model concentrates on what matters for outlining. It doesn't need to read a 200-line code block to know that a section is about "implementing the authentication flow."

Crucially, the prompt knows about the compression. In v3.6, I added an explicit # Omissions section:

# Omissions
- You're provided the general structure of the conversation. Tables, code blocks,
  long paragraphs, long lists, and other non-textual content are omitted from the
  assistant's messages. You can assume that the assistant's messages are
  well-formatted. Headings are 100% provided. Use these as your clue.
- You should focus on the structure, logic, and flow of the conversation.
  Do not try to fill in omitted details.

Telling the model "headings are 100% provided, use these as your clue" is important. It prevents the model from spending capacity trying to infer what was cut, and directs attention to the most useful remaining signal.

Technique 2: Dynamic Budget Injection

Early pipeline versions had a static instruction: "target 3–8 topics." That's obviously wrong — a 5-message conversation and a 200-message conversation shouldn't target the same range.

I moved to dynamic budget computation. Given the conversation's assistant token volume (T) and turn count, the pipeline computes concrete targets before each call:

// Topic count scales logarithmically — short chats get 2-3, long chats plateau ~10-12
rawTop = 1 + 2.1 * log2(T / 1500 + 1)
targetTopTopics = clamp(round(rawTop), 1, 12)

// Total nodes scale linearly — keeps information density roughly consistent
rawTotalNodes = T / 350
targetTotalNodes = clamp(round(rawTotalNodes), targetTopTopics * 2, 72)

targetAvgSubtopics = clamp(round(targetTotalNodes / targetTopTopics - 1), 2, 5)

These targets get injected directly into the prompt instructions:

# Length Target:
- You should generate 7 topics and 38 total subtopics.
- Each topic should on average have around 4 subtopics.
- Allocate more subtopics to high-signal topics and fewer to low-signal topics.

The logarithmic scaling for topics was tuned empirically — I tried linear, square root, and log, and log2 produced the most natural-feeling outlines across my eval dataset. The linear scaling for total nodes means a conversation with 2x more content gets roughly 2x more outline nodes, which feels intuitive.

One subtle addition in v3.5: I added index fields to the schema (index: 1, 2, 3... for both topics and subtopics) and told the model:

"You can use the topic index and subtopic index as counters to keep track of your progress toward the target."

This gives the model a counting mechanism during generation. Without it, models tend to lose track around topic 5-6 and either stop too early or overshoot. The index fields act as a built-in counter.

Technique 3: The Prompt Sandwich

When the conversation history is long — 10K, 20K, even 50K tokens — instructions placed only at the beginning lose influence. The model processes thousands of tokens of conversation content, and by the time it starts generating, the initial constraints have faded.

In practice, I saw:

Topic counts drifting away from targets
Output language switching (common in bilingual conversations)
Anchoring discipline breaking down

The fix is structurally simple: repeat your critical dynamic constraints after the payload.

Prompt assembly became intentionally three-part:

┌─────────────────────────────────────────┐
│  Developer message (front)              │  ← Full task contract, all rules
├─────────────────────────────────────────┤
│  User message (middle)                  │  ← The conversation: 5K-50K tokens
│  <conversation>...</conversation>       │     of chat messages
├─────────────────────────────────────────┤
│  Developer message (back)               │  ← Compact final reminder
│  <final_reminder>                       │
│    - Output in 中文                      │
│    - 7 top-level topics                 │
│    - 38 subtopics total                 │
│  </final_reminder>                      │
└─────────────────────────────────────────┘

The trailing reminder contains only the dynamic constraints most likely to drift: language, topic count, subtopic count. The full rule set stays in the front. The back just re-anchors the three numbers the model is most likely to forget after processing a long payload.

I paired this with a second reinforcement layer: a reasoning field in the structured output schema.

const TocSchema = z.object({
  reasoning: z.string().describe(
    "Think through the key requirements of the task: " +
    "length target for topics, length target for subtopics, " +
    "and output language requirement."
  ),
  toc: z.array(/* ... topic schema ... */)
})

The model must fill in reasoning before generating toc. This forces it to re-articulate the constraints one more time, right at the start of structured output generation.

Two layers of reinforcement:

Trailing developer message restates constraints after the long payload
Schema-required reasoning re-grounds constraints before generating output

Together, these measurably improved instruction following. This was especially important after I moved to single-call one-shot generation — there's no second aggregation pass to fix structural errors from the first call.

Technique 4: Anchoring Semantics — Think UX, Not Source

A small change that mattered more than expected. In v3.1, I changed the refId instruction from:

refId MUST be the earliest message id introducing the idea.

to:

refId MUST be the earliest message id where a reader can start consuming the content for this item. Prefer anchoring to the assistant's answer if the user asks a question and the answer immediately follows.

The difference: "where was this idea first mentioned" vs. "where should the user actually jump to."

If a user asks "how do I set up authentication?" at message #14 and the assistant answers at message #15, the refId should be 15. The answer is what they want to read, not their own question.

This is about encoding UX intent into prompt semantics. The model doesn't inherently know that your product is a navigation tool. By redefining "anchor" as "where to start reading" instead of "where the idea originates," you align model behavior with how users actually think about clicking a TOC entry.

The Results

Quality scores across comparable pipeline versions (all using the same LLM-as-judge prompt, all run on gpt-4.1-mini):

Version	toc_quality	Key Changes
baseline	0.831	Pre-one-shot chunked pipeline (`gpt-4.1-mini`)
one-shot-v1	0.807	Initial one-shot architecture (quality dipped)
v3.1	0.819	Dynamic budgets, refId anchoring semantics
v3.2	0.808	Input compression added (`extractMarkdownStructure`)
v3.4	0.808	Compression parameter tuning
v3.6	0.826	Omissions section, prompt sandwich, index counters
v3.7	0.863	Schema reasoning reinforcement, prompt packing

And the token efficiency trajectory from the earlier chart:

Version	Avg Input Tokens	Change from v3.1
v3.1	31,704	—
v3.2	13,118	-59%
v3.4	10,633	-67%
v3.7	11,554	-64%

The story these two charts tell together: quality and efficiency improved simultaneously, but through different mechanisms at different times.

v3.2 was the token efficiency inflection — input compression cut tokens by 59% while quality held flat. Quality didn't degrade because the compression was structure-preserving, and the model wasn't getting useful signal from those code blocks and tables anyway.

v3.6/v3.7 was the quality inflection — the prompt sandwich, reasoning schema, and omissions section improved instruction following without meaningful token cost change.

Compared to the baseline, the final v3.7 pipeline scores 0.863 vs 0.831 — a meaningful improvement — while using 64% fewer input tokens than when we started the one-shot line. That's the kind of result that only comes from treating the entire pipeline as a system — input shaping, prompt structure, schema design, eval framework — not just rewriting the instruction text.

Model Selection

Before diving into prompt iteration, I ran a multi-model sweep under comparable settings:

Model	toc_quality	n
gpt-4.1-mini	0.831	26
gpt-5.2	0.842	25
x-ai/grok-4.1-fast	0.825	23
gpt-4.1	0.823	26
gpt-4o-mini	0.745	26
gpt-4.1-nano	0.562	26

The top tier (gpt-4.1-mini, gpt-5.2, grok, gpt-4.1) clustered within ~2 points of each other. Below that, quality fell off a cliff — gpt-4o-mini dropped 8 points and nano was essentially unusable for this task.

gpt-4.1-mini had the best quality/cost/reliability tradeoff. The takeaway: pick your model empirically with your eval, then spend effort on pipeline engineering instead of model shopping. The gap between gpt-4.1-mini and gpt-5.2 was smaller than the gap I later closed through prompt and pipeline improvements alone.

The Prompt Evolution At A Glance

For those curious about what actually changed in the prompt text across versions:

Version	Prompt Change	Pipeline / Schema Change
v1 (chunked)	3 separate prompts: chunk, aggregate, full	Two-stage chunk-then-merge pipeline
one-shot-v1	Single prompt, static "target 3–8 topics"	One-shot architecture
v3	Dynamic `${targetTopics}`, `${targetSubtopics}` injected	`outlineBudget()` function added
v3.1	refId = "where reader starts consuming"; requirement extraction rules	—
v3.2–v3.4	(unchanged)	`extractMarkdownStructure`, compression tuning
v3.5	Added `index` fields for counting	Schema updated
v3.6	Added `# Omissions` section; `<final_reminder>` block	Prompt sandwich (3-message structure)
v3.7	(unchanged)	`reasoning` field in schema

An interesting pattern: versions v3.2 through v3.4 had zero prompt text changes — all improvement came from pipeline-level input shaping. And the biggest quality jump (v3.6 to v3.7) came from structural techniques (prompt ordering, schema design) rather than rewriting the core instruction text.

This suggests that in many cases, the leverage isn't in finding better words. It's in shaping what surrounds the words.

Takeaways

1. Build your eval before you iterate on prompts. This isn't new advice, but the specific design matters. Task-based evaluation (not string similarity) with per-task reasoning gave me a debugging tool, not just a number. When something regressed, I knew which tasks broke and why. The eval turned every decision into a measurable experiment.

2. Shape model inputs, not just model instructions. Structured input compression produced the single largest efficiency gain in this project: 63% fewer tokens. It also improved quality by reducing noise. If you're working with long-context inputs, look critically at what you're sending to the model. There's probably structured content — tables, code, repetitive lists — you can compress without losing the signal the model needs for your task.

3. For long-context structured output, use a prompt sandwich. Repeat your critical dynamic constraints after the payload. Pair it with schema-level reasoning that forces the model to re-articulate those constraints before generating output. It's cheap, easy to implement, and measurably improves instruction following.

4. Make your structured output schema do work. The reasoning field and the index counters aren't decorative. They give the model scaffolding for self-monitoring during generation. If you're using structured output, think about what fields you can add that help the model stay on track — not just fields you need in the final output.

5. Encode UX intent into prompt semantics. When your prompt defines terms like "anchor" or "reference," think about what behavior you actually want from the end user's perspective. "Where the idea first appears" and "where the user should jump to" are different behaviors that require different prompt definitions.

6. Design eval metrics that match your product, not academic benchmarks. My eval weights task coverage at 45% and faithfulness at 2%. That weighting reflects what matters for a navigation tool. Your product has different priorities — make your eval reflect them.

The meta-lesson from this whole effort: prompt engineering for production structured output isn't about finding magic words. It's systems engineering. The prompt, the input pipeline, the output schema, the token budget, and the eval framework all have to mature together. Improving any one in isolation hits diminishing returns quickly. Improving them as a system is how you get both better quality and lower cost at the same time.