Running Evaluations

An Evaluation Run is a single execution of an evaluation. Each run sends every query in the query set to the search endpoint, collects the results, and prepares them for judgment.

Creating an Evaluation Run

In the UI

Navigate to Evaluations and select an evaluation
Click Create Run
Enter a Name for the run
Select the Scale for judging relevance; see Scales for details
Select the Metrics you want calculated; see Metrics for details
Click Create

Using the API

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/${EVALUATION_ID}/runs" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
  "name": "Run 1 - Baseline",
  "scale": "graded",
  "metrics": ["NDCG@10", "MAP", "ERR@10", "MRR@10"]
}'

Scale Options

Three scales are available, each defining the granularity of relevance grades. Pick the scale that matches how nuanced your judgments need to be. See Scales for the full reference.

Scale	Range	Use when
`binary`		Pass/fail relevance is enough: was this result relevant or not?
`graded`		Most evaluation work; distinguishes marginal / fair / highly / perfectly relevant.
`detailed`		Fine-grained ranking work where small differences in relevance matter.

Metrics

Metrics are specified by name, optionally with a cutoff depth using @k. For example, NDCG@10 computes NDCG over the top 10 results.

See Metrics for formulas and worked examples.

Overriding the template for a single run

By default, a new run inherits the query template attached to the evaluation. To compare ranking variants without forking the template, supply overrides on the create-run request: any of the request body, query string, content type, and headers can be replaced for that run only. A typical use is keeping the same endpoint and query set, but tweaking one knob between runs:

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/${EVALUATION_ID}/runs" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
  "name": "Re-rank weight 0.8",
  "scale": "graded",
  "metrics": ["NDCG@10", "MAP"],
  "body": "{ \"query\": { \"function_score\": { \"query\": { \"match\": { \"title\": \"{{query}}\" } }, \"weight\": 0.8 } } }"
}'

Overrides are preserved on the run itself, so the comparison between runs can flag exactly which parts of the configuration changed.

Starting a Run

After creating a run, start it to begin query execution.

In the UI

Open the evaluation, locate the pending run, and choose Start from its actions menu. The status updates live as the run progresses through Queued → Running → Completed.

Using the API

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/start" \
-H "Authorization: Bearer ${TOKEN}"

Run Status

Each run progresses through the following statuses:

Status	Description
Pending	Run has been created but not started
Queued	Start has been requested; the run will begin shortly
Running	Queries are being executed against the endpoint
Completed	All queries have been executed and can be judged
Locked	Judgments are frozen; the run is read-only. See Locking a run.
Cancelled	The run was stopped before completing. See Cancelling runs.
Failed	An error occurred during execution

Real-Time Progress

The UI displays progress updates live while a run is executing, so you can watch a long run advance without refreshing.

Viewing Results

Once a run completes, browse the queries table to see what came back from each query, and expand any row to inspect the candidates in detail.

In the UI

Select the completed run to see the list of queries and their results. Each query shows:

The executed URL and request body
The candidates returned by the search endpoint
The position of each candidate in the results

Using the API

List queries in a run:

curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/queries?page=1&page_size=20" \
-H "Authorization: Bearer ${TOKEN}"

List results for a specific query:

curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/queries/${QUERY_ID}/results?page=1&page_size=20" \
-H "Authorization: Bearer ${TOKEN}"

Locking a run

A locked run is read-only; its judgments cannot be added, changed, or deleted, and it will not accept new AI judging runs. Locking is how Releval preserves a "result of record" for a configuration once you've moved on to iterating.

Automatic locking

Creating a new run in an evaluation automatically locks the most recent completed run in that evaluation. The intent is that the previous run is the baseline you just decided to iterate past, so its judgments should stop changing under your feet. If you need to keep editing the prior run's judgments, do it before kicking off the new run.

Locking manually

You can lock any completed run yourself, which is useful when you want to freeze a milestone result without starting another run yet:

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/lock" \
  -H "Authorization: Bearer ${TOKEN}"

Only completed runs can be locked. Locked runs remain in metric trends and comparisons (see below); locking is about preventing judgment churn, not hiding the run.

Comparing runs

Two views let you see how relevance is moving across the runs in an evaluation: a per-evaluation trend across all runs, and a head-to-head between two runs.

Metric trends across an evaluation

The metric-trends endpoint returns each completed or locked run's metric values in chronological order, alongside a flag indicating whether the run's configuration changed from the previous one and which parts changed if so. This is what powers the "metric over time" chart in the UI, with a marker on the runs where the configuration changed. The chart plots on a real date/time axis, and a time-range picker narrows it to a rolling window (such as the last 7 days or 24 hours) or an absolute date range to focus on recent movement.

curl "https://${RELEVAL_HOST}/api/v1/evaluations/${EVALUATION_ID}/metrics/trends" \
  -H "Authorization: Bearer ${TOKEN}"

Head-to-head comparison

The comparison is the workhorse for deciding whether a change shipped an improvement. It lines two runs up side by side — a candidate against a baseline — and reports, in a single view, how the relevance metrics moved, how much the result lists actually changed, and how execution time differed. Both runs must belong to the same evaluation and be completed or locked.

From the runs list, open a completed or locked run's actions menu and choose Compare, then pick the baseline run to measure against.

The comparison brings together three perspectives on the same pair of runs:

Relevancy metrics — run-level deltas for each metric (absolute and relative), and a per-query breakdown classifying every query as improved, regressed, or unchanged against the baseline. This is where you find the specific queries a change helped or hurt.
Similarity metrics — how much the ranked result lists changed between the runs, independent of any relevance judgments. Per query you get Rank-Biased Overlap (RBO) and Jaccard similarity, plus the churn: how many candidates the candidate run added and dropped versus the baseline. Aggregates (mean RBO, mean Jaccard, total added/dropped) summarise the whole run. Because these need no judgments, they tell you whether a "latency-only" or re-indexing change left the rankings intact, even before anyone grades a result.
Performance — execution-time deltas per query and on average. Durations are wall-clock and environment-dependent, so the delta between the two runs is the signal, not the absolute numbers.

Clicking any query in the comparison opens a Candidate Results Diff: the baseline and candidate result lists side by side, with connectors showing how each candidate moved and each one badged with its relevance grade. It is the fastest way to see exactly what changed for a query that regressed or improved.

curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${CANDIDATE_RUN_ID}/compare/${BASELINE_RUN_ID}?page=1&page_size=50" \
  -H "Authorization: Bearer ${TOKEN}"

See the Compare Evaluation Runs endpoint for the full response shape.

Cloning a Run

You can clone an existing run to re-execute the same queries. This is useful when you've made changes to the search endpoint and want to compare results:

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/clone" \
-H "Authorization: Bearer ${TOKEN}"

This creates a new run with the same configuration and queues it for execution.

Creating an Evaluation Run​

In the UI​

Using the API​

Scale Options​

Metrics​

Overriding the template for a single run​

Starting a Run​

In the UI​

Using the API​

Run Status​

Real-Time Progress​

Viewing Results​

In the UI​

Using the API​

Locking a run​

Automatic locking​

Locking manually​

Comparing runs​

Metric trends across an evaluation​

Head-to-head comparison​

Cloning a Run​

Creating an Evaluation Run

In the UI

Using the API

Scale Options

Metrics

Overriding the template for a single run

Starting a Run

In the UI

Using the API

Run Status

Real-Time Progress

Viewing Results

In the UI

Using the API

Locking a run

Automatic locking

Locking manually

Comparing runs

Metric trends across an evaluation

Head-to-head comparison

Cloning a Run