Evaluation

Add one line to score every LLM response with Fluiq's server-side judge. Set per-metric thresholds and choose whether failures log a warning or block the call from reaching your application.

Python
import fluiq

fluiq.instrument(api_key="fl_...")
fluiq.eval(
    thresholds={
        "hallucination": 0.8,   # score 0–1; 1 = no hallucination
        "faithfulness":  0.7,   # grounded in provided context
        "relevance":     0.75,  # response addresses the question
        "toxicity":      0.9,   # 1 = completely safe
    },
    mode="warn",                # "warn" | "block"
    judge_model="gpt-4o-mini",  # judge model Fluiq uses server-side
)

Supported metrics

MetricWhat it measuresScore 1.0 means
hallucinationFactual claims not supported by the prompt/contextNo hallucination — every claim is grounded
faithfulnessWhether the response stays within the provided contextFully grounded — no outside claims added
relevanceHow directly the response addresses the questionCompletely on-topic and direct
toxicityHarmful, offensive, or hateful content in the responseCompletely safe and respectful
coherenceLogical structure and internal consistencyPerfectly coherent and well-structured
completenessWhether the response fully answers the questionComprehensive — no key information omitted

Modes

mode="warn" (default)

Evaluation runs in a background thread after the LLM responds. Your application receives the response immediately. A Python warning is logged for every metric that falls below its threshold — visible in your logs and in the Fluiq dashboard's Quality column.

mode="block"

Evaluation runs synchronously before returning the response. If any metric is below its threshold, a FluiqEvalError is raised instead — the low-quality response never reaches your application. Use in staging or for safety-critical flows.

Python
from fluiq.exceptions import FluiqEvalError

try:
    response = client.chat.completions.create(...)
except FluiqEvalError as e:
    print(e.failures)   # {"hallucination": 0.42, "relevance": 0.61}
    print(e.scores)     # all metric scores
    # fallback logic here

GitHub Actions eval gate

Gate every PR on quality scores stored during your test suite. The workflow below runs your tests (which generate traces evaluated by Fluiq), waits briefly for async evals to land, then queries the Fluiq API and fails the build if any score is below the threshold.

Python
# .github/workflows/fluiq-eval-gate.yml
name: Fluiq Eval Gate

on:
  pull_request:
    branches: [main]

jobs:
  eval-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install -r requirements.txt fluiq

      - name: Run test suite
        env:
          FLUIQ_API_KEY: ${{ secrets.FLUIQ_API_KEY }}
        run: pytest tests/ -x

      - name: Wait for evaluations
        run: sleep 30

      - name: Check evaluation scores
        env:
          FLUIQ_API_KEY: ${{ secrets.FLUIQ_API_KEY }}
          THRESHOLD: ${{ vars.FLUIQ_EVAL_THRESHOLD || '0.7' }}
        run: |
          python - <<'PYEOF'
          import httpx, os, sys
          api_key   = os.environ["FLUIQ_API_KEY"]
          threshold = float(os.environ.get("THRESHOLD", "0.7"))
          resp = httpx.get(
              "https://api.getfluiq.com/api/v1/optimize/evals",
              headers={"x-api-key": api_key},
              params={"window_minutes": 10, "threshold": threshold},
              timeout=15,
          )
          resp.raise_for_status()
          data = resp.json()
          if data["total"] == 0:
              print("No evaluations found — skipping gate.")
              sys.exit(0)
          avg = data.get("avg_score")
          print(f"Evals: {data['total']} total, {data['passed']} passed, {data['failed']} failed  (avg {f'{avg:.2f}' if avg else 'n/a'})")
          if data["failed"] > 0:
              for e in data["entries"]:
                  if e["score"] is not None and e["score"] < threshold:
                      print(f"  FAIL  {e['metric']}: {e['score']:.2f}  trace={e['trace_id']}")
              sys.exit(1)
          print(f"All scores above threshold ({threshold}).")
          PYEOF

Quotas

Each LLM response evaluation consumes one count from your tier's eval budget. Traces continue to ingest normally once the cap is hit — only the auto-eval is skipped.

TierTracesEvaluations / month
Free50K / mo1,000
TeamUnlimited10,000
GrowthUnlimited100,000
EnterpriseUnlimitedUnlimited