Evaluations

An Evaluation ties together the core building blocks of a search relevance assessment:

A Search Endpoint: the search system to query
A Query Set: the queries to run
A Query Template: how queries are formatted into requests

Once you create an evaluation, you can run it repeatedly to track how your search relevance changes over time.

Creating an Evaluation

In the UI

Navigate to Evaluations and click Create Evaluation
Enter a Name and optional Description
Select the Endpoint, Query Set, and Query Template to use
Click Create

Using the API

curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
--data @- <<EOF
{
  "name": "Product Search v2",
  "description": "Evaluate new ranking model",
  "endpoint_id": "${ENDPOINT_ID}",
  "query_set_id": "${QUERY_SET_ID}",
  "query_template_id": "${QUERY_TEMPLATE_ID}"
}
EOF

Evaluation Workflow

The typical workflow for evaluating search relevance:

Create an Evaluation: define what you're evaluating
Create an Evaluation Run: choose the relevance scale and metrics
Start the run: Releval executes each query against your endpoint and collects results
Judge the results: rate how relevant each returned candidate is
Review metrics: Releval automatically calculates metrics from your judgments
Iterate: create new runs to track improvements as you adjust your search configuration

Comparing runs

Because each evaluation can have multiple runs, you can track how metrics evolve:

Run before and after changing a ranking model
Run against different endpoints (production vs. staging)
Run with different query templates to compare search strategies

Each run preserves its results and metrics independently, giving you a historical record of search quality. To dig into a specific change, compare two runs head-to-head: the comparison reports per-query relevance metric deltas, how much the result lists changed (RBO, Jaccard, and candidate churn, independent of judgments), and execution-time differences, down to a side-by-side diff of the candidates each run returned.

Creating an Evaluation​

In the UI​

Using the API​

Evaluation Workflow​

Comparing runs​

Creating an Evaluation

In the UI

Using the API

Evaluation Workflow

Comparing runs