Optimization
Call fluiq.optimize() after instrument() to enable trace-driven Redis caching. Fluiq's backend analyses your historical traces, identifies which LLM calls repeat most, and provisions a dedicated Redis instance for your account. On the first call the SDK fetches that profile and begins serving repeated prompts from cache — saving both latency and LLM spend with no extra code.
Team plan and above
fluiq.optimize() requires a Team, Growth, or Enterprise plan. Calling it on a Free account logs a warning and skips caching — tracing continues normally, your application is never interrupted.
Setup
import fluiq
fluiq.instrument(api_key="fl_...")
fluiq.optimize()
# All LLM calls from this point are transparently intercepted.
# Repeated (model, messages) pairs are served from Redis instantly —
# no LLM API call is made and your spend drops accordingly.How it works
- On the first LLM call after startup the SDK fetches your optimization profile from the Fluiq backend.
- The profile contains which models to cache, the suggested TTL, and the connection URL for your dedicated Redis instance.
- Subsequent calls with an identical
(model, messages)combination are served from Redis instantly — your LLM provider is never contacted. - Real responses are cached automatically — there is nothing extra to instrument.
- The dashboard Optimization tab shows cache hit rate and estimated spend saved alongside your traces.
Modes
"cache"default
Full Redis caching enabled. Repeated calls matching the backend profile are served from Redis before the LLM API is called. Real responses are stored automatically.
"observe"optional
No interception. The SDK records what would have been a cache hit so you can review potential savings — latency and spend — before opting into full caching.
fluiq.optimize(mode="observe") # review savings first
fluiq.optimize(mode="cache") # then enable full cachingFail-open by design
If the profile endpoint is unreachable, returns an error, or Redis is unavailable, every LLM call proceeds normally to your provider. The cache layer never blocks your application.
MCP tool caching
When MCP servers are in use, fluiq.optimize() transparently caches two expensive operations on every ClientSession:
list_tools()— response cached in Redis keyed by server URL. Automatically invalidated whensession.initialize()is called (server restart).call_tool(name, arguments)— result cached keyed by(server_url, tool_name, sorted_arguments). Error results are never cached.
Hit and miss counts appear in the Optimize dashboard under mcp_list_tools and mcp_call in the "By cache type" breakdown.
# No extra code required — MCP caching is transparent once optimize() is called.
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client
async with streamablehttp_client("https://your-mcp-server/mcp") as (r, w, _):
async with ClientSession(r, w) as session:
await session.initialize()
tools = await session.list_tools() # cached after first call
result = await session.call_tool("search", {"query": "fluiq"}) # cachedProvider prompt caching
In addition to Fluiq's own Redis layer, fluiq.optimize() unlocks each provider's built-in prefix caching and surfaces the saved token counts in every trace.
Anthropiccache_control injected automatically
Every messages.create() and messages.stream() call receives cache_control: {"type": "ephemeral"} on the system prompt and last tool definition. Anthropic silently ignores it on blocks below ~1,024 tokens, so injection is always safe. Cached token counts (prompt_cache_read_tokens, prompt_cache_creation_tokens) appear in every trace.
OpenAIautomatic for prompts ≥ 1,024 tokens
No configuration required. OpenAI caches eligible prompts automatically. Fluiq captures usage.prompt_tokens_details.cached_tokens from every response as prompt_cached_tokens.
Geminiexplicit CachedContent (user-managed)
Create a CachedContent object via the Gemini API and pass it to generate_content. Fluiq captures usage_metadata.cached_content_token_count from every response as prompt_cached_tokens.
All three providers feed the Prompt Caching card on the Optimize dashboard, which shows total cached tokens read, Anthropic cache-write overhead, and hit rate across instrumented calls.