Evaluations
An Evaluation ties together the core building blocks of a search relevance assessment:
- A Search Endpoint: the search system to query
- A Query Set: the queries to run
- A Query Template: how queries are formatted into requests
Once you create an evaluation, you can run it repeatedly to track how your search relevance changes over time.
Creating an Evaluation
In the UI
- Navigate to Evaluations and click Create Evaluation
- Enter a Name and optional Description
- Select the Endpoint, Query Set, and Query Template to use
- Click Create
Using the API
curl -X POST "https://${RELEVAL_HOST}/api/v1/evaluations" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
--data @- <<EOF
{
"name": "Product Search v2",
"description": "Evaluate new ranking model",
"endpoint_id": "${ENDPOINT_ID}",
"query_set_id": "${QUERY_SET_ID}",
"query_template_id": "${QUERY_TEMPLATE_ID}"
}
EOF
Evaluation Workflow
The typical workflow for evaluating search relevance:
- Create an Evaluation: define what you're evaluating
- Create an Evaluation Run: choose the relevance scale and metrics
- Start the run: Releval executes each query against your endpoint and collects results
- Judge the results: rate how relevant each returned candidate is
- Review metrics: Releval automatically calculates metrics from your judgments
- Iterate: create new runs to track improvements as you adjust your search configuration
Comparing runs
Because each evaluation can have multiple runs, you can track how metrics evolve:
- Run before and after changing a ranking model
- Run against different endpoints (production vs. staging)
- Run with different query templates to compare search strategies
Each run preserves its results and metrics independently, giving you a historical record of search quality. To dig into a specific change, compare two runs head-to-head: the comparison reports per-query relevance metric deltas, how much the result lists changed (RBO, Jaccard, and candidate churn, independent of judgments), and execution-time differences, down to a side-by-side diff of the candidates each run returned.