Evaluation

Add one line to score every LLM response with Fluiq's server-side judge. Set per-metric thresholds and choose whether failures log a warning or block the call from reaching your application.

Python

import fluiq

fluiq.instrument(api_key="fl_...")
fluiq.eval(
    thresholds={
        "hallucination": 0.8,   # score 0-1; 1 = no hallucination
        "faithfulness":  0.7,   # grounded in provided context
        "relevance":     0.75,  # response addresses the question
        "toxicity":      0.9,   # 1 = completely safe
    },
    mode="warn",                # "warn" | "block"
    judge_model="gpt-4o-mini",  # judge model Fluiq uses server-side
)

Evaluation is opt-in

instrument() only traces; it never evaluates on its own. Scoring runs only once you call fluiq.eval(…) (or trigger an evaluation from the dashboard). From that point every LLM response in the process is scored against your thresholds; remove the call and Fluiq goes back to tracing only. Evaluation runs asynchronously server-side and never adds latency to your app.

Supported metrics

Metric	What it measures	Score 1.0 means
hallucination	Factual claims not supported by the prompt/context	No hallucination; every claim is grounded
faithfulness	Whether the response stays within the provided context	Fully grounded; no outside claims added
relevance	How directly the response addresses the question	Completely on-topic and direct
toxicity	Harmful, offensive, or hateful content in the response	Completely safe and respectful
coherence	Logical structure and internal consistency	Perfectly coherent and well-structured
completeness	Whether the response fully answers the question	Comprehensive; no key information omitted

Custom judges

Built-in metrics not enough? Write your own LLM-as-judge. In the dashboard go to Prompts, write your judge prompt, and Save it with type Judge. The template uses {{question}}, {{answer}} and {{context}} placeholders and should ask the model to return a JSON object with a numeric score (0 to 1) and a reason. Then reference it by its slug in custom_judges (slug → threshold). Each judge is scored on every response just like a built-in metric and appears in the dashboard under its slug.

Python

# Saved in Prompts (type: Judge), slug "refund-policy":
#
#   You are auditing a support reply for refund-policy compliance.
#   QUESTION: {{question}}
#   ANSWER:   {{answer}}
#   POLICY:   {{context}}
#   Return JSON: {"score": <0-1>, "reason": "<why>"}

fluiq.eval(
    metrics=["hallucination"],          # built-ins still run
    thresholds={"hallucination": 0.8},
    custom_judges={
        "refund-policy":  0.9,          # slug → pass threshold
        "brand-tone":     0.7,
    },
    mode="warn",
)

In block mode a custom judge scoring below its threshold raises FluiqEvalError just like a built-in metric. If a slug doesn't resolve to a saved Judge prompt it is silently skipped; your call is never broken by a missing judge.

Transparent, editable judge prompts

No black-box scoring: every score records the exact judge prompt (and its version) that produced it — expand Judge promptsunder any evaluation in the trace drawer to read it. If a grading rubric doesn't match how you want a metric judged, edit it at Dashboard → Judge Prompts: your edit applies only to your organization within about a minute, required placeholders are validated so a save can't break scoring, and you can reset to the platform prompt or restore any earlier version. Because scores carry the prompt version, you can tell exactly when a rubric change happened in your score history.

User feedback & team annotations

Judges aren't the only signal. Record your end users' reactions with fluiq.feedback() right after the LLM call the user is reacting to — the verdict lands next to the automated scores on that trace. Your team can also add a thumbs-up/down with a note on any trace from the drawer's Evaluation tab. Human signals are shown alongside judge scores but never move the automated quality rollups.

Python

import fluiq

fluiq.instrument(api_key="fl_...")

answer = client.chat.completions.create(...)   # traced call
show_to_user(answer)

# later, when the user reacts:
fluiq.feedback(True, name="thumbs")                     # 👍 on the last LLM call
fluiq.feedback(0.25, name="csat", comment="Too slow",   # or a 0-1 rating
               trace_id=saved_trace_id)                 # target a specific trace

Modes

mode="warn" (default)

Evaluation runs in a background thread after the LLM responds. Your application receives the response immediately. A warning is logged for every metric that falls below its threshold, visible in your logs and in the Fluiq dashboard's Quality column.

mode="block"

Evaluation runs synchronously before returning the response. If any metric is below its threshold, a FluiqEvalError is raised instead; the low-quality response never reaches your application. Use in staging or for safety-critical flows.

Python

from fluiq.exceptions import FluiqEvalError

try:
    response = client.chat.completions.create(...)
except FluiqEvalError as e:
    print(e.failures)   # {"hallucination": 0.42, "relevance": 0.61}
    print(e.scores)     # all metric scores
    # fallback logic here

GitHub Actions eval gate

Gate every PR on a real eval run. python -m fluiq.ci launches a batch evaluation over one of your datasets(grading each example against its expected output — or a full agentic run), waits for the report, prints per-metric averages, and exits non-zero when the average score is below your gate, failing the build with an annotated error. It works in any repo — the gate runs against your Fluiq dataset, not your test suite's language.

Python

# .github/workflows/fluiq-eval-gate.yml
name: Fluiq Eval Gate

on:
  pull_request:
    branches: [main]

jobs:
  eval-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install Fluiq
        run: pip install fluiq

      - name: Run eval gate
        env:
          FLUIQ_API_KEY: ${{ secrets.FLUIQ_API_KEY }}
        run: |
          python -m fluiq.ci \
            --dataset "checkout-regressions" \
            --kind metrics \
            --metrics hallucination,relevance,completeness \
            --fail-below 0.7 \
            --min-example 0.5

--kind agenticruns the full layered agent evaluation (tool selection, trajectory, coordination) over each example's pinned trajectory instead. --min-exampleadditionally fails the build when any single example falls below that floor, so one bad regression can't hide behind a good average. Exit codes: 0 pass, 1 gate failed, 2 error/timeout.

Quotas

Tracing is always free and unlimited; the paid axis is trace retention (Free keeps 14 days, paid keeps forever). Evaluation is metered separately: each scored LLM response consumes one count from your tier's eval budget. When the eval budget is exhausted, traces keep ingesting normally; only new scoring is paused until the next cycle.

Tier	Traces	Retention	Evaluations / month
Free	Unlimited	14 days	100
Starter	Unlimited	Forever	2,000
Team	Unlimited	Forever	10,000
Growth	Unlimited	Forever	50,000
Enterprise	Unlimited	Forever	Unlimited

Starter, Team, and Growth include a no-card 5-day trial so you can try unlimited retention and higher eval budgets before upgrading.

import fluiq fluiq.instrument(api_key="fl_...") fluiq.eval( thresholds={ "hallucination": 0.8, # score 0-1; 1 = no hallucination "faithfulness": 0.7, # grounded in provided context "relevance": 0.75, # response addresses the question "toxicity": 0.9, # 1 = completely safe }, mode="warn", # "warn" | "block" judge_model="gpt-4o-mini", # judge model Fluiq uses server-side )

Metric

What it measures

Score 1.0 means

hallucination

Factual claims not supported by the prompt/context

No hallucination; every claim is grounded

faithfulness

Whether the response stays within the provided context

Fully grounded; no outside claims added

relevance

How directly the response addresses the question

Completely on-topic and direct

toxicity

Harmful, offensive, or hateful content in the response

Completely safe and respectful

coherence

Logical structure and internal consistency

Perfectly coherent and well-structured

completeness

Whether the response fully answers the question

Comprehensive; no key information omitted

# Saved in Prompts (type: Judge), slug "refund-policy": # # You are auditing a support reply for refund-policy compliance. # QUESTION: {{question}} # ANSWER: {{answer}} # POLICY: {{context}} # Return JSON: {"score": <0-1>, "reason": "<why>"} fluiq.eval( metrics=["hallucination"], # built-ins still run thresholds={"hallucination": 0.8}, custom_judges={ "refund-policy": 0.9, # slug → pass threshold "brand-tone": 0.7, }, mode="warn", )

import fluiq fluiq.instrument(api_key="fl_...") answer = client.chat.completions.create(...) # traced call show_to_user(answer) # later, when the user reacts: fluiq.feedback(True, name="thumbs") # 👍 on the last LLM call fluiq.feedback(0.25, name="csat", comment="Too slow", # or a 0-1 rating trace_id=saved_trace_id) # target a specific trace

from fluiq.exceptions import FluiqEvalError try: response = client.chat.completions.create(...) except FluiqEvalError as e: print(e.failures) # {"hallucination": 0.42, "relevance": 0.61} print(e.scores) # all metric scores # fallback logic here

# .github/workflows/fluiq-eval-gate.yml name: Fluiq Eval Gate on: pull_request: branches: [main] jobs: eval-gate: runs-on: ubuntu-latest steps: - uses: actions/setup-python@v5 with: python-version: "3.12" - name: Install Fluiq run: pip install fluiq - name: Run eval gate env: FLUIQ_API_KEY: ${{ secrets.FLUIQ_API_KEY }} run: | python -m fluiq.ci \ --dataset "checkout-regressions" \ --kind metrics \ --metrics hallucination,relevance,completeness \ --fail-below 0.7 \ --min-example 0.5

Tier

Traces

Retention

Evaluations / month

Free

Unlimited

14 days

100

Starter

Unlimited

Forever

2,000

Team

Unlimited

Forever

10,000

Growth

Unlimited

Forever

50,000

Enterprise

Unlimited

Forever

Unlimited