- Published on
Prompt Engineering a 'Simple' Summarization Pipeline
TL;DR: I built a Chrome extension that generates a clickable table of contents from AI chat conversations. Summarization sounds trivial — it's the "hello world" of LLM applications. In practice, making it reliable across conversations ranging from 1K to 100K+ tokens required real prompt engineering: a task-based eval framework, structured input compression that cut input tokens by 63%, dynamic length budgets, and a prompt sandwich technique for long-context instruction following. This post covers the techniques that moved the needle and the eval-driven workflow that made each change trustworthy.
- The Task: Harder Than It Looks
- Eval First: Building the North Star
- Technique 1: Structured Input Compression
- Technique 2: Dynamic Budget Injection
- Technique 3: The Prompt Sandwich
- Technique 4: Anchoring Semantics — Think UX, Not Source
- The Results
- The Prompt Evolution At A Glance
- Takeaways
The Task: Harder Than It Looks
Chat Navigator is a Chrome extension I'm building. It reads your AI chat conversations — ChatGPT, Claude, Gemini, DeepSeek — and generates a clickable table of contents. An outline you can use to jump back to specific topics in long threads.
The core pipeline is simple: messages in, structured TOC out. A single LLM call.
But the quality bar is specific:
- Coverage: capture important conversation pivots, not just early topics
- Anchor accuracy: each TOC entry has a
refIdpointing to a specific message — clicking it should jump to the right place - Label quality: topic labels need to be concise, scannable, and recognizable
- Length control: a 5-message chat shouldn't produce 12 topics, and a 200-message chat shouldn't produce 3
"Summarize this conversation" is the hello world of LLM applications. "Generate a hierarchical outline with accurate jump anchors, dynamic length, and consistent quality across 1K–100K token inputs" is a different task entirely.
I spent about three days (Dec 30, 2025 – Jan 1, 2026) doing concentrated prompt engineering on this pipeline, running over 20 eval runs across 8 pipeline versions. This post distills what I learned into the techniques that actually mattered.
Eval First: Building the North Star
The biggest unlock wasn't a prompt technique. It was building evals before I started iterating on prompts.
This sounds obvious in retrospect, but I've seen many prompt engineering efforts (including my own earlier work) where the loop is: change the prompt, eyeball a few outputs, decide if it "feels better." That works for about two iterations before you lose track of whether you're actually improving.
Task-Based Evaluation, Not String Similarity
I didn't have gold-standard TOCs to compare against. And even if I did, string similarity against a reference outline would be a poor proxy — there are many valid ways to outline the same conversation.
Instead, I designed navigation tasks. Each conversation in my eval dataset (26 conversations covering a wide difficulty range) has a set of nav_tasks: things a user might want to find when revisiting the conversation. Each task has:
- A description (e.g., "find where they discussed the tradeoff between latency and accuracy")
- A
target_refId— the message where this topic lives - An
importanceweight (1, 2, or 3)
The LLM judge evaluates: can a user find this task in the TOC? If they can, does the anchor jump to the right place?
This maps directly to what matters for the product. I don't care if the outline uses different words than some reference — I care whether a user can navigate with it.
The Judge Prompt
The judge scores along six dimensions, weighted to reflect product priorities:
task_coverage dominates at 45% because it IS the product: can users find what they're looking for? anchor_accuracy at 20% is next because jumping to the wrong message is a broken experience. label_quality and structure_quality at 15% each matter for scannability. conciseness and faithfulness are small guardrails.
The judge uses a matching protocol with explicit tolerance bands for anchoring:
| Grade | Distance | Description |
|---|---|---|
| exact | refId == target | Perfect anchor |
| near | |refId - target| <= 2 | Close enough |
| acceptable | |refId - target| <= 5 | Usable but imprecise |
| far | > 5 | Treat as weak hit or miss |
Each task is scored as hit (1.0), weak_hit (0.5), or miss (0.0), weighted by importance.
One design choice I'm particularly happy with: the judge outputs structured task_results with per-task reasoning. Here's a condensed excerpt of the output schema:
{
"task_results": [
{
"task_id": "T1",
"importance": 3,
"outcome": "hit",
"matched_path": "Authentication Setup > JWT Token Strategy",
"anchor_grade": "near",
"label_grade": "good",
"notes": "Label is specific and refId lands 1 message from target"
}
],
"top_fixes": [
"Add a subtopic for the Redis caching decision at message #47",
"Split 'Backend Architecture' — it covers 3 unrelated topics"
]
}
When a score drops, I can look at which specific tasks went from hit to miss, and why. This turned vague "the score went down" into actionable "task T7 regressed because the model collapsed two topics into one generic bucket." The top_fixes field gives me concrete next steps.
Program-Based Guardrails
Alongside the LLM judge, I tracked three deterministic metrics:
length_top_topics: how close the topic count is to the targetlength_total_nodes: how close total outline nodes are to the targetcompression_ratio: input tokens / output label tokens
These catch failure modes the judge might miss: "good coverage but the outline is way too long" or "decent labels but half the conversation is missing." They're cheap to compute and don't burn API credits.
A key design choice: the eval length targets and the pipeline's own prompt targets are derived by the same outlineBudget() function. This prevents evaluator/pipeline drift — the eval never scores against targets the pipeline didn't know about.
The Iteration Loop
Every prompt change went through:
- Look at per-task reasoning logs — which tasks are misses?
- Patch the prompt, schema, or input pipeline
- Rerun the eval
- Check: did the specific misses improve? Did anything else regress?
This is what made every technique in the rest of this post trustworthy. Without it, I'd have no way to distinguish "this feels better on the two examples I checked" from "this actually improved across 26 diverse conversations."
Technique 1: Structured Input Compression
Here's a number: between pipeline v3.1 and v3.7, average input tokens per LLM call dropped from 31,704 to 11,554 — a 63% reduction.
The mechanism is a function called extractMarkdownStructure. Before the conversation hits the LLM, each long assistant message gets compressed:
- Headers are always preserved — they're the strongest structural signal for outlining
- List items are sampled — keep the first N items, add a truncation marker
- Code blocks and tables are replaced with brief markers
- Long paragraphs are truncated to a token budget (prefix + suffix)
The compression level is tunable. I ran ablations on:
- How many tokens to keep per paragraph
- Whether to keep 1 or 3 list items
- How aggressively to drop tables vs. keep them
63% was the sweet spot — I could go more aggressive, but quality starts to degrade beyond that.
This is a triple win:
- Cost: fewer input tokens = lower API cost per call
- Latency: fewer input tokens = faster time to first token
- Quality: less noise for the model's attention mechanism = slightly better output
The third point is consistent but not magic — it's just less distraction. A 50,000-token conversation with full code blocks, tables, and multi-paragraph explanations gives the model more opportunity to lose focus. Strip it down to structural signal, and the model concentrates on what matters for outlining. It doesn't need to read a 200-line code block to know that a section is about "implementing the authentication flow."
Crucially, the prompt knows about the compression. In v3.6, I added an explicit # Omissions section:
# Omissions
- You're provided the general structure of the conversation. Tables, code blocks,
long paragraphs, long lists, and other non-textual content are omitted from the
assistant's messages. You can assume that the assistant's messages are
well-formatted. Headings are 100% provided. Use these as your clue.
- You should focus on the structure, logic, and flow of the conversation.
Do not try to fill in omitted details.
Telling the model "headings are 100% provided, use these as your clue" is important. It prevents the model from spending capacity trying to infer what was cut, and directs attention to the most useful remaining signal.
Technique 2: Dynamic Budget Injection
Early pipeline versions had a static instruction: "target 3–8 topics." That's obviously wrong — a 5-message conversation and a 200-message conversation shouldn't target the same range.
I moved to dynamic budget computation. Given the conversation's assistant token volume (T) and turn count, the pipeline computes concrete targets before each call:
// Topic count scales logarithmically — short chats get 2-3, long chats plateau ~10-12
rawTop = 1 + 2.1 * log2(T / 1500 + 1)
targetTopTopics = clamp(round(rawTop), 1, 12)
// Total nodes scale linearly — keeps information density roughly consistent
rawTotalNodes = T / 350
targetTotalNodes = clamp(round(rawTotalNodes), targetTopTopics * 2, 72)
targetAvgSubtopics = clamp(round(targetTotalNodes / targetTopTopics - 1), 2, 5)
These targets get injected directly into the prompt instructions:
# Length Target:
- You should generate 7 topics and 38 total subtopics.
- Each topic should on average have around 4 subtopics.
- Allocate more subtopics to high-signal topics and fewer to low-signal topics.
The logarithmic scaling for topics was tuned empirically — I tried linear, square root, and log, and log2 produced the most natural-feeling outlines across my eval dataset. The linear scaling for total nodes means a conversation with 2x more content gets roughly 2x more outline nodes, which feels intuitive.
One subtle addition in v3.5: I added index fields to the schema (index: 1, 2, 3... for both topics and subtopics) and told the model:
"You can use the topic index and subtopic index as counters to keep track of your progress toward the target."
This gives the model a counting mechanism during generation. Without it, models tend to lose track around topic 5-6 and either stop too early or overshoot. The index fields act as a built-in counter.
Technique 3: The Prompt Sandwich
When the conversation history is long — 10K, 20K, even 50K tokens — instructions placed only at the beginning lose influence. The model processes thousands of tokens of conversation content, and by the time it starts generating, the initial constraints have faded.
In practice, I saw:
- Topic counts drifting away from targets
- Output language switching (common in bilingual conversations)
- Anchoring discipline breaking down
The fix is structurally simple: repeat your critical dynamic constraints after the payload.
Prompt assembly became intentionally three-part:
┌─────────────────────────────────────────┐
│ Developer message (front) │ ← Full task contract, all rules
├─────────────────────────────────────────┤
│ User message (middle) │ ← The conversation: 5K-50K tokens
│ <conversation>...</conversation> │ of chat messages
├─────────────────────────────────────────┤
│ Developer message (back) │ ← Compact final reminder
│ <final_reminder> │
│ - Output in 中文 │
│ - 7 top-level topics │
│ - 38 subtopics total │
│ </final_reminder> │
└─────────────────────────────────────────┘
The trailing reminder contains only the dynamic constraints most likely to drift: language, topic count, subtopic count. The full rule set stays in the front. The back just re-anchors the three numbers the model is most likely to forget after processing a long payload.
I paired this with a second reinforcement layer: a reasoning field in the structured output schema.
const TocSchema = z.object({
reasoning: z.string().describe(
"Think through the key requirements of the task: " +
"length target for topics, length target for subtopics, " +
"and output language requirement."
),
toc: z.array(/* ... topic schema ... */)
})
The model must fill in reasoning before generating toc. This forces it to re-articulate the constraints one more time, right at the start of structured output generation.
Two layers of reinforcement:
- Trailing developer message restates constraints after the long payload
- Schema-required reasoning re-grounds constraints before generating output
Together, these measurably improved instruction following. This was especially important after I moved to single-call one-shot generation — there's no second aggregation pass to fix structural errors from the first call.
Technique 4: Anchoring Semantics — Think UX, Not Source
A small change that mattered more than expected. In v3.1, I changed the refId instruction from:
refId MUST be the earliest message id introducing the idea.
to:
refId MUST be the earliest message id where a reader can start consuming the content for this item. Prefer anchoring to the assistant's answer if the user asks a question and the answer immediately follows.
The difference: "where was this idea first mentioned" vs. "where should the user actually jump to."
If a user asks "how do I set up authentication?" at message #14 and the assistant answers at message #15, the refId should be 15. The answer is what they want to read, not their own question.
This is about encoding UX intent into prompt semantics. The model doesn't inherently know that your product is a navigation tool. By redefining "anchor" as "where to start reading" instead of "where the idea originates," you align model behavior with how users actually think about clicking a TOC entry.
The Results
Quality scores across comparable pipeline versions (all using the same LLM-as-judge prompt, all run on gpt-4.1-mini):
| Version | toc_quality | Key Changes |
|---|---|---|
| baseline | 0.831 | Pre-one-shot chunked pipeline (gpt-4.1-mini) |
| one-shot-v1 | 0.807 | Initial one-shot architecture (quality dipped) |
| v3.1 | 0.819 | Dynamic budgets, refId anchoring semantics |
| v3.2 | 0.808 | Input compression added (extractMarkdownStructure) |
| v3.4 | 0.808 | Compression parameter tuning |
| v3.6 | 0.826 | Omissions section, prompt sandwich, index counters |
| v3.7 | 0.863 | Schema reasoning reinforcement, prompt packing |
And the token efficiency trajectory from the earlier chart:
| Version | Avg Input Tokens | Change from v3.1 |
|---|---|---|
| v3.1 | 31,704 | — |
| v3.2 | 13,118 | -59% |
| v3.4 | 10,633 | -67% |
| v3.7 | 11,554 | -64% |
The story these two charts tell together: quality and efficiency improved simultaneously, but through different mechanisms at different times.
v3.2 was the token efficiency inflection — input compression cut tokens by 59% while quality held flat. Quality didn't degrade because the compression was structure-preserving, and the model wasn't getting useful signal from those code blocks and tables anyway.
v3.6/v3.7 was the quality inflection — the prompt sandwich, reasoning schema, and omissions section improved instruction following without meaningful token cost change.
Compared to the baseline, the final v3.7 pipeline scores 0.863 vs 0.831 — a meaningful improvement — while using 64% fewer input tokens than when we started the one-shot line. That's the kind of result that only comes from treating the entire pipeline as a system — input shaping, prompt structure, schema design, eval framework — not just rewriting the instruction text.
Model Selection
Before diving into prompt iteration, I ran a multi-model sweep under comparable settings:
| Model | toc_quality | n |
|---|---|---|
| gpt-4.1-mini | 0.831 | 26 |
| gpt-5.2 | 0.842 | 25 |
| x-ai/grok-4.1-fast | 0.825 | 23 |
| gpt-4.1 | 0.823 | 26 |
| gpt-4o-mini | 0.745 | 26 |
| gpt-4.1-nano | 0.562 | 26 |
The top tier (gpt-4.1-mini, gpt-5.2, grok, gpt-4.1) clustered within ~2 points of each other. Below that, quality fell off a cliff — gpt-4o-mini dropped 8 points and nano was essentially unusable for this task.
gpt-4.1-mini had the best quality/cost/reliability tradeoff. The takeaway: pick your model empirically with your eval, then spend effort on pipeline engineering instead of model shopping. The gap between gpt-4.1-mini and gpt-5.2 was smaller than the gap I later closed through prompt and pipeline improvements alone.
The Prompt Evolution At A Glance
For those curious about what actually changed in the prompt text across versions:
| Version | Prompt Change | Pipeline / Schema Change |
|---|---|---|
| v1 (chunked) | 3 separate prompts: chunk, aggregate, full | Two-stage chunk-then-merge pipeline |
| one-shot-v1 | Single prompt, static "target 3–8 topics" | One-shot architecture |
| v3 | Dynamic ${targetTopics}, ${targetSubtopics} injected | outlineBudget() function added |
| v3.1 | refId = "where reader starts consuming"; requirement extraction rules | — |
| v3.2–v3.4 | (unchanged) | extractMarkdownStructure, compression tuning |
| v3.5 | Added index fields for counting | Schema updated |
| v3.6 | Added # Omissions section; <final_reminder> block | Prompt sandwich (3-message structure) |
| v3.7 | (unchanged) | reasoning field in schema |
An interesting pattern: versions v3.2 through v3.4 had zero prompt text changes — all improvement came from pipeline-level input shaping. And the biggest quality jump (v3.6 to v3.7) came from structural techniques (prompt ordering, schema design) rather than rewriting the core instruction text.
This suggests that in many cases, the leverage isn't in finding better words. It's in shaping what surrounds the words.
Takeaways
1. Build your eval before you iterate on prompts. This isn't new advice, but the specific design matters. Task-based evaluation (not string similarity) with per-task reasoning gave me a debugging tool, not just a number. When something regressed, I knew which tasks broke and why. The eval turned every decision into a measurable experiment.
2. Shape model inputs, not just model instructions. Structured input compression produced the single largest efficiency gain in this project: 63% fewer tokens. It also improved quality by reducing noise. If you're working with long-context inputs, look critically at what you're sending to the model. There's probably structured content — tables, code, repetitive lists — you can compress without losing the signal the model needs for your task.
3. For long-context structured output, use a prompt sandwich. Repeat your critical dynamic constraints after the payload. Pair it with schema-level reasoning that forces the model to re-articulate those constraints before generating output. It's cheap, easy to implement, and measurably improves instruction following.
4. Make your structured output schema do work. The reasoning field and the index counters aren't decorative. They give the model scaffolding for self-monitoring during generation. If you're using structured output, think about what fields you can add that help the model stay on track — not just fields you need in the final output.
5. Encode UX intent into prompt semantics. When your prompt defines terms like "anchor" or "reference," think about what behavior you actually want from the end user's perspective. "Where the idea first appears" and "where the user should jump to" are different behaviors that require different prompt definitions.
6. Design eval metrics that match your product, not academic benchmarks. My eval weights task coverage at 45% and faithfulness at 2%. That weighting reflects what matters for a navigation tool. Your product has different priorities — make your eval reflect them.
The meta-lesson from this whole effort: prompt engineering for production structured output isn't about finding magic words. It's systems engineering. The prompt, the input pipeline, the output schema, the token budget, and the eval framework all have to mature together. Improving any one in isolation hits diminishing returns quickly. Improving them as a system is how you get both better quality and lower cost at the same time.