Evaluation

Evaluation is opt-in: instrument() only traces, and scoring runs once you call fluiq.eval(). From that point Fluiq runs an LLM-as-judge on every traced LLM response, scores each metric (0 to 1), and stores results in your dashboard. Use block mode to gate on quality in CI. For whole-run scoring (tool selection, trajectory, and multi-agent coordination), run an agentic evaluation on a root trace or over a Dataset from the dashboard.

Call fluiq.eval() once after instrument(); every subsequent traced LLM call is scored in the background. Results appear in the Evaluations dashboard; a warning is logged when a score falls below its threshold.

Python

import openai
import fluiq

fluiq.instrument(api_key="fl_...")
fluiq.eval(
    metrics=["hallucination", "relevance"],  # scored on every LLM call
    mode="warn",                             # default; never blocks
    thresholds={"hallucination": 0.8, "relevance": 0.7},
)

client = openai.OpenAI()

# Evaluation fires automatically after this call
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What year did World War II end?"}],
)
print(response.choices[0].message.content)
# Scores visible in the Fluiq Evaluations tab

Evaluation

Python

import openai
import fluiq

fluiq.instrument(api_key="fl_...")
fluiq.eval(
    metrics=["hallucination", "relevance"],  # scored on every LLM call
    mode="warn",                             # default; never blocks
    thresholds={"hallucination": 0.8, "relevance": 0.7},
)

client = openai.OpenAI()

# Evaluation fires automatically after this call
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What year did World War II end?"}],
)
print(response.choices[0].message.content)
# Scores visible in the Fluiq Evaluations tab