Corpora
A Corpus represents a document collection that one or more search endpoints query against, for example a product catalogue, a knowledge base, or a movie database. The corpus is the underlying data; endpoints are the different ways your search infrastructure exposes that data.
A corpus has a name, an optional description, and a match mode that fixes how its queries line up with relevance judgments (see Match mode). Its value comes from what it groups together.
Why corpora matter
The most important effect of a corpus is judgment portability. Judgments and judgment lists are scoped per corpus, not per endpoint, which means:
- A judgment of "Air Zoom Pegasus is highly relevant for 'running shoes'" can be reused across every endpoint that points at the same product corpus.
- When you spin up a new endpoint to test a different ranking model on the same data, the new evaluation run automatically benefits from existing judgments without re-judging the same candidates.
- Imported judgment lists (from logs, prior campaigns, or expert annotation) attach to a corpus and are immediately available to every endpoint sharing it.
A judgment is uniquely keyed by (corpus, query, candidate), so the same query and candidate
in a different corpus is treated as a separate judgment.
Match mode
A corpus's match mode decides what "the same query" means when judgments are matched to the queries in an evaluation run. It is chosen once when the corpus is created and cannot be changed afterwards, since changing it would silently repoint every existing judgment.
- Query text (
qid) — the default. A judgment applies to every query with the same text, whatever request body it was wrapped in. Judge "running shoes" once and that grade counts for any run whose query text is "running shoes", even if two runs send structurally different requests. - Full query object (
oid) — a judgment applies only to the exact query object it was made against, so two runs that share the same text but send different request bodies keep their judgments apart. Use this when the wrapping request changes what "relevant" means for the same text.
Query text is matched exactly, including case and whitespace, so Red Shoes and red shoes count
as different queries. The match mode applies to every judgment on the corpus — interactive
judgments, AI judgments, and uploaded judgment lists alike.
Modelling your corpora
A common pattern: one corpus per logical document collection, multiple endpoints per corpus. For example:
| Corpus | Endpoints sharing it |
|---|---|
products-uk | prod-elasticsearch, prod-elasticsearch-bm25-tuned, prod-rerank-v2 |
support-articles | support-elasticsearch, support-vector |
movies-demo | movies-elasticsearch, movies-opensearch |
Use a separate corpus when:
- The underlying documents differ (a product catalogue vs. a knowledge base).
- The document IDs aren't compatible across systems (judgments would mismatch).
- You want judgments isolated for compliance or experimentation reasons.
Reuse an existing corpus when you're trying different search configurations against the same data.
Creating a corpus
In the UI
- Navigate to Corpora from the sidebar and click Create.
- Enter a Name and optional Description.
- Choose a Match mode (defaults to query text); see Match mode. It can't be changed later.
- Click Create.
You can then assign endpoints to it from the endpoint creation or edit page.
Using the API
curl -X POST "https://${RELEVAL_HOST}/api/v1/corpora" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"name": "products-uk",
"description": "UK product catalogue, used by all production search variants.",
"match_mode": "qid"
}'
match_mode is optional and defaults to qid (match by query text); see
Match mode. The response includes the new corpus ID, which you supply as
corpus_id when creating an endpoint.
Managing corpora
List corpora
curl "https://${RELEVAL_HOST}/api/v1/corpora" \
-H "Authorization: Bearer ${TOKEN}"
Each entry includes an endpoint_count so you can see how many endpoints attach to it.
Update a corpus
curl -X PUT "https://${RELEVAL_HOST}/api/v1/corpora/${CORPUS_ID}" \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"name": "products-uk",
"description": "UK product catalogue, including refurbished items."
}'
Delete a corpus
curl -X DELETE "https://${RELEVAL_HOST}/api/v1/corpora?corpus_id=${CORPUS_ID}" \
-H "Authorization: Bearer ${TOKEN}"
A corpus that still has endpoints attached cannot be deleted; the API returns
409 Conflict. Delete or reassign the endpoints first, then delete the corpus.