@context-chef/tanstack-ai — Context engineering middleware for TanStack AI #450

MyPrototypeWhat · 2026-04-14T12:04:38Z

MyPrototypeWhat
Apr 14, 2026

Hi TanStack AI community!

I've published @context-chef/tanstack-ai — a ChatMiddleware that brings transparent context engineering to TanStack AI. It handles the problems that come up in long-running agent conversations: context window overflow, bloated tool outputs, and state drift.

What it does

Drop it into the middleware array and it works behind the scenes:

import { contextChefMiddleware } from '@context-chef/tanstack-ai';
import { chat } from '@tanstack/ai';
import { openaiText } from '@tanstack/ai-openai';

const stream = chat({
  adapter: openaiText('gpt-4o'),
  messages,
  middleware: [
    contextChefMiddleware({
      contextWindow: 128_000,
      compress: { adapter: openaiText('gpt-4o-mini') },
      truncate: { threshold: 5000, headChars: 500, tailChars: 1000 },
    }),
  ],
});

Features

Feature	What it does
History compression	When token budget is exceeded, older messages are summarized by a cheap model
Mechanical compaction	Zero-LLM-cost pruning — strip old tool call/result pairs, empty messages
Tool result truncation	Large tool outputs (terminal logs, API responses) auto-truncated, head+tail preserved
VFS offloading	Optionally persist original content behind `context://` URIs for on-demand retrieval
Dynamic state injection	Inject runtime state (agent step, task progress) as XML into prompts
Token tracking	Feeds `onUsage` back to the compression engine automatically
Transform hook	Post-processing for RAG injection or custom prompt manipulation

Pipeline

chat() call
  ↓
onConfig:
  1. Truncate large tool results
  2. Convert to context-chef IR
  3. Compact (mechanical, zero cost)
  4. Compress (LLM summarization if over budget)
  5. Convert back to TanStack AI format
  6. Inject dynamic state
  7. Apply custom transform
  ↓
LLM call
  ↓
onUsage:
  8. Feed promptTokens back to compression engine

The middleware is stateful — it tracks token usage across calls so it knows when compression is needed.

Built on TanStack AI's middleware system

This uses the ChatMiddleware interface (onConfig + onUsage). I found it to be a clean and effective extension point — the separation between config-time transforms and post-response hooks maps perfectly to context engineering needs.

Links

npm: @context-chef/tanstack-ai
Source: context-chef/packages/tanstack-ai
Parent project: ContextChef — the core library with additional features like tool namespace pruning, cross-session memory, and snapshot/restore

Would love to hear feedback from anyone who tries it out or has thoughts on context management patterns for TanStack AI agents.

jingchang0623-crypto · 2026-04-27T06:03:52Z

jingchang0623-crypto
Apr 27, 2026

This hits a nerve. Running a 5-agent content factory 24/7 and context bloat is literally our biggest token sink.

What we learned the hard way

Our content agent loads 6 markdown files on every session startup (OpenClaw's AGENTS.md + SOUL.md + TOOLS.md + USER.md + MEMORY.md + scene memory blocks). Total: ~40KB. That's 10K tokens gone before the agent even says hello.

We tried three approaches before finding something that works:

Approach 1: Load everything (burn rate: $$)

Every agent loaded full context every time. Result: 15% of tokens spent on context, agents started hallucinating instructions from each other's files. One agent read another agent's "don't publish without approval" rule and started blocking all its own outputs.

Approach 2: Selective loading (burn rate: $ but fragile)

Only load relevant context based on task type. Result: agents sometimes missed critical rules. Our marketing agent once published without approval because the "review required" instruction was in a context block it didn't load.

Approach 3: Tiered injection + compressed fallback (current)

# Always inject (every turn)
Tier 1: SOUL.md (200 lines, identity + rules)  # ~500 tokens

# Inject once per session
Tier 2: TOOLS.md, USER.md                   # ~1500 tokens

# Search-and-retrieve (RAG)
Tier 3: scene memory blocks, past conversations # on-demand

The key insight: Tier 1 should be SMALL enough that the agent never tunes it out. Our most obedient agent has the shortest SOUL.md (200 lines). The one with the longest (800 lines) started selectively ignoring rules after week 3.

Why your middleware matters

Your tool result truncation feature alone would save us probably 20% on tool-heavy tasks. When our research agent fetches a 50KB RSS feed, the full text gets shoved into context. Your headChars + tailChars approach is exactly right — for most use cases, the beginning and end of a tool result are what matters, the middle is noise.

The VFS offloading is also clever. We keep a "memory vault" of past conversation summaries that agents could benefit from, but loading all of them would blow the context window. Being able to reference them via URI and retrieve on-demand would be a game-changer.

One suggestion

Add a "context heat map" feature — track which parts of injected context the agent actually references in its outputs. After 100 sessions, you'd know exactly which paragraphs in AGENTS.md are dead weight and which are critical. We do this manually (review agent logs quarterly) and it's been invaluable for trimming our context files.

Full context management patterns from our 5-agent setup: https://miaoquai.com/stories/agent-team-drama.html

👍 for this project. The middleware approach is the right abstraction layer.

1 reply

MyPrototypeWhat Apr 27, 2026
Author

Thanks for the detailed write-up.

A few notes on where context-chef already lines up with what you described:

Tier 1 discipline. We just shipped a Skill primitive — a dedicated, switchable instructions slot in the message sandwich, independent from the user's system prompt. The "keep it short or the model tunes it out" rule maps onto it directly; it's the right primitive to enforce that discipline at the API level.
Tool result truncation. The truncate.headChars/tailChars design came from the same observation — middle of a tool result is usually noise. Glad to hear it lines up with what you're seeing.
VFS / memory vault. The context:// URI scheme is intended for exactly that pattern. Lifecycle cleanup (TTL, LRU eviction) is the next item on the roadmap.
Heat map. The chef already returns meta.injectedMemoryKeys from compile() and emits compile:done / compress / memory:* events. An attribution layer is ~50 LOC of user-side instrumentation today; whether to bake any of it into core is something I'm watching.

Appreciate the input.

kinthaiofficial · 2026-04-28T23:54:07Z

kinthaiofficial
Apr 28, 2026

Context engineering as middleware is the right framing. In multi-agent setups, the "context budget" problem compounds because each agent in a delegation chain adds overhead.

One pattern that helped us: context-aware routing at the middleware level. Before sending a request to the LLM, the middleware estimates the "context density" — how much of the current context is actually relevant to this specific turn. If density is below a threshold (e.g., <30% of tokens are relevant to the current task), trigger compaction before the call, not after.

This is counterintuitive — most systems compact reactively ("we're out of space, compress now"). Proactive compaction based on density keeps costs lower because you never pay for a turn that carries 70% irrelevant context.

For the token budget management aspect: we track budgets in millicents (1/100,000 of a dollar) for precision. At the middleware level, each request gets tagged with its cost ceiling, and the system reserves that amount before the call fires. After completion, unused budget is credited back. This prevents the "5 parallel calls all think they have budget" race condition.

Wrote about the full economic model for agent budgets: https://blog.kinthai.ai/agent-wallet-economic-models-autonomous-agents

1 reply

MyPrototypeWhat Apr 29, 2026
Author

Thanks.

A couple of notes from the context-chef side:

Density-based proactive compaction is a real axis — the current janitor compresses on token budget, not relevance. The hooks for building it are already in place: transformContext (pre-compile rewrite), compile:done (token + meta inspection), meta.injectedMemoryKeys (what actually got referenced). The density estimator itself stays in user land by design — there's no free way to score relevance (embedding similarity, heuristics, LLM-as-judge all have their own costs), so that policy choice is left to the developer.
Per-call budget reservation / refund is out of scope here intentionally. context-chef's job ends at producing the message array; cost-ceiling enforcement and reservation accounting belong at the orchestrator/runtime layer, or a quota proxy in front of the API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

@context-chef/tanstack-ai — Context engineering middleware for TanStack AI #450

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

@context-chef/tanstack-ai — Context engineering middleware for TanStack AI #450

Uh oh!

MyPrototypeWhat Apr 14, 2026

What it does

Features

Pipeline

Built on TanStack AI's middleware system

Links

Replies: 2 comments · 2 replies

Uh oh!

jingchang0623-crypto Apr 27, 2026

What we learned the hard way

Approach 1: Load everything (burn rate: $$)

Approach 2: Selective loading (burn rate: $ but fragile)

Approach 3: Tiered injection + compressed fallback (current)

Why your middleware matters

One suggestion

Uh oh!

MyPrototypeWhat Apr 27, 2026 Author

Uh oh!

kinthaiofficial Apr 28, 2026

Uh oh!

MyPrototypeWhat Apr 29, 2026 Author

MyPrototypeWhat
Apr 14, 2026

Replies: 2 comments 2 replies

jingchang0623-crypto
Apr 27, 2026

MyPrototypeWhat Apr 27, 2026
Author

kinthaiofficial
Apr 28, 2026

MyPrototypeWhat Apr 29, 2026
Author