Running Evaluations
An Evaluation Run is a single execution of an evaluation. Each run sends every query in the query set to the search endpoint, collects the results, and prepares them for judgment.
Creating an Evaluation Run
In the UI
- Navigate to Evaluations and select an evaluation
- Click Create Run
- Enter a Name for the run
- Select the Scale for judging relevance; see Scales for details
- Select the Metrics you want calculated; see Metrics for details
- Click Create
Using the API
curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/${EVALUATION_ID}/runs" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"name": "Run 1 - Baseline",
"scale": "graded",
"metrics": ["NDCG@10", "MAP", "ERR@10", "MRR@10"]
}'
Scale Options
Three scales are available, each defining the granularity of relevance grades. Pick the scale that matches how nuanced your judgments need to be. See Scales for the full reference.
| Scale | Range | Use when |
|---|---|---|
binary | Pass/fail relevance is enough: was this result relevant or not? | |
graded | Most evaluation work; distinguishes marginal / fair / highly / perfectly relevant. | |
detailed | Fine-grained ranking work where small differences in relevance matter. |
Metrics
Metrics are specified by name, optionally with a cutoff depth using @k. For example,
NDCG@10 computes NDCG over the top 10 results.
See Metrics for formulas and worked examples.
Overriding the template for a single run
By default, a new run inherits the query template attached to the evaluation. To compare ranking variants without forking the template, supply overrides on the create-run request: any of the request body, query string, content type, and headers can be replaced for that run only. A typical use is keeping the same endpoint and query set, but tweaking one knob between runs:
curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/${EVALUATION_ID}/runs" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"name": "Re-rank weight 0.8",
"scale": "graded",
"metrics": ["NDCG@10", "MAP"],
"body": "{ \"query\": { \"function_score\": { \"query\": { \"match\": { \"title\": \"{{query}}\" } }, \"weight\": 0.8 } } }"
}'
Overrides are preserved on the run itself, so the comparison between runs can flag exactly which parts of the configuration changed.
Starting a Run
After creating a run, start it to begin query execution.
In the UI
Open the evaluation, locate the pending run, and choose Start from its actions menu. The status updates live as the run progresses through Queued → Running → Completed.
Using the API
curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/start" \
-H "Authorization: Bearer ${TOKEN}"
Run Status
Each run progresses through the following statuses:
| Status | Description |
|---|---|
| Pending | Run has been created but not started |
| Queued | Start has been requested; the run will begin shortly |
| Running | Queries are being executed against the endpoint |
| Completed | All queries have been executed and can be judged |
| Locked | Judgments are frozen; the run is read-only. See Locking a run. |
| Cancelled | The run was stopped before completing. See Cancelling runs. |
| Failed | An error occurred during execution |
Real-Time Progress
The UI displays progress updates live while a run is executing, so you can watch a long run advance without refreshing.
Viewing Results
Once a run completes, browse the queries table to see what came back from each query, and expand any row to inspect the candidates in detail.
In the UI
Select the completed run to see the list of queries and their results. Each query shows:
- The executed URL and request body
- The candidates returned by the search endpoint
- The position of each candidate in the results
Using the API
List queries in a run:
curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/queries?page=1&page_size=20" \
-H "Authorization: Bearer ${TOKEN}"
List results for a specific query:
curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/queries/${QUERY_ID}/results?page=1&page_size=20" \
-H "Authorization: Bearer ${TOKEN}"
Locking a run
A locked run is read-only; its judgments cannot be added, changed, or deleted, and it will not accept new AI judging runs. Locking is how Releval preserves a "result of record" for a configuration once you've moved on to iterating.
Automatic locking
Creating a new run in an evaluation automatically locks the most recent completed run in that evaluation. The intent is that the previous run is the baseline you just decided to iterate past, so its judgments should stop changing under your feet. If you need to keep editing the prior run's judgments, do it before kicking off the new run.
Locking manually
You can lock any completed run yourself, which is useful when you want to freeze a milestone result without starting another run yet:
curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/lock" \
-H "Authorization: Bearer ${TOKEN}"
Only completed runs can be locked. Locked runs remain in metric trends and comparisons (see below); locking is about preventing judgment churn, not hiding the run.
Comparing runs
Two views let you see how relevance is moving across the runs in an evaluation: a per-evaluation trend across all runs, and a head-to-head between two runs.
Metric trends across an evaluation
The metric-trends endpoint returns each completed or locked run's metric values in chronological order, alongside a flag indicating whether the run's configuration changed from the previous one and which parts changed if so. This is what powers the "metric over time" chart in the UI, with a marker on the runs where the configuration changed. The chart plots on a real date/time axis, and a time-range picker narrows it to a rolling window (such as the last 7 days or 24 hours) or an absolute date range to focus on recent movement.
curl "https://${RELEVAL_HOST}/api/v1/evaluations/${EVALUATION_ID}/metrics/trends" \
-H "Authorization: Bearer ${TOKEN}"
Head-to-head comparison
The comparison is the workhorse for deciding whether a change shipped an improvement. It lines two runs up side by side — a candidate against a baseline — and reports, in a single view, how the relevance metrics moved, how much the result lists actually changed, and how execution time differed. Both runs must belong to the same evaluation and be completed or locked.
From the runs list, open a completed or locked run's actions menu and choose Compare, then pick the baseline run to measure against.
The comparison brings together three perspectives on the same pair of runs:
- Relevancy metrics — run-level deltas for each metric (absolute and relative), and a per-query breakdown classifying every query as improved, regressed, or unchanged against the baseline. This is where you find the specific queries a change helped or hurt.
- Similarity metrics — how much the ranked result lists changed between the runs, independent of any relevance judgments. Per query you get Rank-Biased Overlap (RBO) and Jaccard similarity, plus the churn: how many candidates the candidate run added and dropped versus the baseline. Aggregates (mean RBO, mean Jaccard, total added/dropped) summarise the whole run. Because these need no judgments, they tell you whether a "latency-only" or re-indexing change left the rankings intact, even before anyone grades a result.
- Performance — execution-time deltas per query and on average. Durations are wall-clock and environment-dependent, so the delta between the two runs is the signal, not the absolute numbers.
Clicking any query in the comparison opens a Candidate Results Diff: the baseline and candidate result lists side by side, with connectors showing how each candidate moved and each one badged with its relevance grade. It is the fastest way to see exactly what changed for a query that regressed or improved.
curl "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${CANDIDATE_RUN_ID}/compare/${BASELINE_RUN_ID}?page=1&page_size=50" \
-H "Authorization: Bearer ${TOKEN}"
See the Compare Evaluation Runs endpoint for the full response shape.
Cloning a Run
You can clone an existing run to re-execute the same queries. This is useful when you've made changes to the search endpoint and want to compare results:
curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations/runs/${RUN_ID}/clone" \
-H "Authorization: Bearer ${TOKEN}"
This creates a new run with the same configuration and queues it for execution.