Evaluation
Add one line to score every LLM response with Fluiq's server-side judge. Set per-metric thresholds and choose whether failures log a warning or block the call from reaching your application.
import fluiq
fluiq.instrument(api_key="fl_...")
fluiq.eval(
thresholds={
"hallucination": 0.8, # score 0–1; 1 = no hallucination
"faithfulness": 0.7, # grounded in provided context
"relevance": 0.75, # response addresses the question
"toxicity": 0.9, # 1 = completely safe
},
mode="warn", # "warn" | "block"
judge_model="gpt-4o-mini", # judge model Fluiq uses server-side
)Supported metrics
| Metric | What it measures | Score 1.0 means |
|---|---|---|
| hallucination | Factual claims not supported by the prompt/context | No hallucination — every claim is grounded |
| faithfulness | Whether the response stays within the provided context | Fully grounded — no outside claims added |
| relevance | How directly the response addresses the question | Completely on-topic and direct |
| toxicity | Harmful, offensive, or hateful content in the response | Completely safe and respectful |
| coherence | Logical structure and internal consistency | Perfectly coherent and well-structured |
| completeness | Whether the response fully answers the question | Comprehensive — no key information omitted |
Modes
mode="warn" (default)
Evaluation runs in a background thread after the LLM responds. Your application receives the response immediately. A Python warning is logged for every metric that falls below its threshold — visible in your logs and in the Fluiq dashboard's Quality column.
mode="block"
Evaluation runs synchronously before returning the response. If any metric is below its threshold, a FluiqEvalError is raised instead — the low-quality response never reaches your application. Use in staging or for safety-critical flows.
from fluiq.exceptions import FluiqEvalError
try:
response = client.chat.completions.create(...)
except FluiqEvalError as e:
print(e.failures) # {"hallucination": 0.42, "relevance": 0.61}
print(e.scores) # all metric scores
# fallback logic hereGitHub Actions eval gate
Gate every PR on quality scores stored during your test suite. The workflow below runs your tests (which generate traces evaluated by Fluiq), waits briefly for async evals to land, then queries the Fluiq API and fails the build if any score is below the threshold.
# .github/workflows/fluiq-eval-gate.yml
name: Fluiq Eval Gate
on:
pull_request:
branches: [main]
jobs:
eval-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install -r requirements.txt fluiq
- name: Run test suite
env:
FLUIQ_API_KEY: ${{ secrets.FLUIQ_API_KEY }}
run: pytest tests/ -x
- name: Wait for evaluations
run: sleep 30
- name: Check evaluation scores
env:
FLUIQ_API_KEY: ${{ secrets.FLUIQ_API_KEY }}
THRESHOLD: ${{ vars.FLUIQ_EVAL_THRESHOLD || '0.7' }}
run: |
python - <<'PYEOF'
import httpx, os, sys
api_key = os.environ["FLUIQ_API_KEY"]
threshold = float(os.environ.get("THRESHOLD", "0.7"))
resp = httpx.get(
"https://api.getfluiq.com/api/v1/optimize/evals",
headers={"x-api-key": api_key},
params={"window_minutes": 10, "threshold": threshold},
timeout=15,
)
resp.raise_for_status()
data = resp.json()
if data["total"] == 0:
print("No evaluations found — skipping gate.")
sys.exit(0)
avg = data.get("avg_score")
print(f"Evals: {data['total']} total, {data['passed']} passed, {data['failed']} failed (avg {f'{avg:.2f}' if avg else 'n/a'})")
if data["failed"] > 0:
for e in data["entries"]:
if e["score"] is not None and e["score"] < threshold:
print(f" FAIL {e['metric']}: {e['score']:.2f} trace={e['trace_id']}")
sys.exit(1)
print(f"All scores above threshold ({threshold}).")
PYEOFQuotas
Each LLM response evaluation consumes one count from your tier's eval budget. Traces continue to ingest normally once the cap is hit — only the auto-eval is skipped.
| Tier | Traces | Evaluations / month |
|---|---|---|
| Free | 50K / mo | 1,000 |
| Team | Unlimited | 10,000 |
| Growth | Unlimited | 100,000 |
| Enterprise | Unlimited | Unlimited |