Optimization

Call fluiq.optimize() after instrument() to enable trace-driven Redis caching. Fluiq's backend analyses your historical traces, identifies which LLM calls repeat most, and provisions a dedicated Redis instance for your account. On the first call the SDK fetches that profile and begins serving repeated prompts from cache, which saves latency and LLM spend with no extra code.

Team plan and above

fluiq.optimize() requires a Team, Growth, or Enterprise plan. Calling it on a Free account logs a warning and skips caching; tracing continues normally, and your application is never interrupted.

Setup

Python

import fluiq

fluiq.instrument(api_key="fl_...")
fluiq.optimize()

# All LLM calls from this point are transparently intercepted.
# Repeated (model, messages) pairs are served from Redis instantly;
# no LLM API call is made and your spend drops accordingly.

How it works

On the first LLM call after startup the SDK fetches your optimization profile from the Fluiq backend.
The profile contains which models to cache, the suggested TTL, and the connection URL for your dedicated Redis instance.
Subsequent calls with an identical (model, messages) combination are served from Redis instantly; your LLM provider is never contacted.
Real responses are cached automatically; there is nothing extra to instrument.
The dashboard Optimization tab shows cache hit rate and estimated spend saved alongside your traces.

Modes

"cache"default

Full Redis caching enabled. Repeated calls matching the backend profile are served from Redis before the LLM API is called. Real responses are stored automatically.

"observe"optional

No interception. The SDK records what would have been a cache hit so you can review potential savings (latency and spend) before opting into full caching.

Python

fluiq.optimize(mode="observe")   # review savings first
fluiq.optimize(mode="cache")     # then enable full caching

Fail-open by design

If the profile endpoint is unreachable, returns an error, or Redis is unavailable, every LLM call proceeds normally to your provider. The cache layer never blocks your application.

MCP tool caching

When MCP servers are in use, fluiq.optimize() transparently caches two expensive operations on every MCP ClientSession:

list_tools(): response cached in Redis keyed by server URL. Automatically invalidated when session.initialize() is called (server restart).
call_tool(name, arguments): result cached keyed by (server_url, tool_name, sorted_arguments). Error results are never cached.

Hit and miss counts appear in the Optimize dashboard under mcp_list_tools and mcp_call in the "By cache type" breakdown.

Python

# No extra code required; MCP caching is transparent once optimize() is called.
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client

async with streamablehttp_client("https://your-mcp-server/mcp") as (r, w, _):
    async with ClientSession(r, w) as session:
        await session.initialize()
        tools = await session.list_tools()   # cached after first call
        result = await session.call_tool("search", {"query": "fluiq"})  # cached

Provider prompt caching

In addition to Fluiq's own Redis layer, fluiq.optimize() unlocks each provider's built-in prefix caching and surfaces the saved token counts in every trace.

Anthropiccache_control injected automatically

Every messages.create() and messages.stream() call receives cache_control: {"type": "ephemeral"} on the system prompt and last tool definition. Anthropic silently ignores it on blocks below ~1,024 tokens, so injection is always safe. Cached token counts (prompt_cache_read_tokens, prompt_cache_creation_tokens) appear in every trace.

OpenAIautomatic for prompts ≥ 1,024 tokens

No configuration required. OpenAI caches eligible prompts automatically. Fluiq captures usage.prompt_tokens_details.cached_tokens from every response as prompt_cached_tokens.

Geminiexplicit CachedContent (user-managed)

Create a CachedContent object via the Gemini API and pass it to generate_content. Fluiq captures the cached content token count from every response as prompt_cached_tokens.

All three providers feed the Prompt Caching card on the Optimize dashboard, which shows total cached tokens read, Anthropic cache-write overhead, and hit rate across instrumented calls.

import fluiq fluiq.instrument(api_key="fl_...") fluiq.optimize() # All LLM calls from this point are transparently intercepted. # Repeated (model, messages) pairs are served from Redis instantly; # no LLM API call is made and your spend drops accordingly.

# No extra code required; MCP caching is transparent once optimize() is called. from mcp import ClientSession from mcp.client.streamable_http import streamablehttp_client async with streamablehttp_client("https://your-mcp-server/mcp") as (r, w, _): async with ClientSession(r, w) as session: await session.initialize() tools = await session.list_tools() # cached after first call result = await session.call_tool("search", {"query": "fluiq"}) # cached