For LLM evals, retrieval comes first
Why embedding retrieval evals should be the foundation of your RAG evaluation strategy
The eval pyramid
Most teams building RAG applications get evals backwards. They reach for LLM-as-judge, human review panels, or multi-model consensus scoring before they have a foundation to build on. These techniques are slow, expensive, and subjective, and shouldn’t be the base of our testing strategy.
We can use the QA testing pyramid as a comparable. A healthy test suite has a broad base of fast unit tests, a middle layer of integration tests, and a narrow top of end-to-end tests. Flip that pyramid and we get a slow, flaky, expensive suite that nobody trusts.
Eval strategies follow the same shape. The base should be embedding retrieval evals: fast, cheap, and objective.
The problem with starting at the top
Take a retrieval eval for: “What caused the French Revolution?” An LLM can answer this dozens of different ways, and the phrasing changes between runs. Grading correctness requires another LLM (or a human), and both introduce subjectivity. We end up debating whether “inequality and taxation” and “the collapse of the ancien régime” mean the same thing. That is an expensive, nondeterministic foundation to build on.
Beyond the subjectivity, LLM-as-judge evals are expensive to run. Each eval sends the question, the retrieved context, the generated answer, and a grading rubric to a model. That is thousands of input tokens per eval. A 500-eval suite can take tens of minutes and cost multiple dollars per run depending on the judge model, with those numbers multiplying with every retry from nondeterministic grading.
Retrieval evals: the base of the pyramid
Let’s start one layer lower. Before asking whether the LLM answered correctly, ask whether it found the right information.
A retrieval eval looks like this: “What page can information on the French Revolution be found on?”
The answer is a specific page number, document ID, or chunk reference. It is deterministic and binary: the system either retrieved the correct source or it didn’t, so no LLM judge is required.
This matters because retrieval is the prerequisite for generation quality. If our system pulls the wrong documents into the context window, the model generates answers from the wrong source material.
Why the pyramid base matters
Retrieval evals give us the fastest feedback loop. When a retrieval eval fails, the fix is concrete: adjust our chunking strategy, tune our embedding model, update our metadata filters, etc. The debugging surface is small and well-defined.
When an LLM-as-judge eval fails, the debugging surface is enormous. Was it a retrieval problem? A prompt problem? A model behavior change? A judge calibration issue? We can’t tell without digging through multiple layers.
And that’s before we event get to cost. A retrieval eval is a cosine similarity check against a vector index which cost fractions of a cent per query and take milliseconds to run. The same 500-eval suite that costs $XX for LLM-as-judge runs for pennies with retrieval evals in a few seconds. A retrieval suite fits into CI the same way unit tests do, whereas an LLM-as-judge suite becomes a costly bottleneck.
Starting at the base also gives our LLM the best possible chance of providing value. Get the right information into the context window first, then worry about whether the model phrases its answer well.
A practical starting point
Pick 50 questions users actually ask. For each one, identify the document or chunk that contains the answer. Write a retrieval eval that asserts the correct chunk appears in the top-k results.
Now we can run this suite on every change to our embedding model, chunking strategy, or ingestion pipeline. It takes seconds, costs almost nothing, and catches the failures that matter most.
Once our retrieval evals are green and stable, we can move up the pyramid1.
Takeaways
Eval infrastructure is where RAG projects succeed or stall. Teams that start with expensive, subjective evaluation methods burn budget on slow feedback loops and ambiguous results. A retrieval-first eval strategy costs a fraction of LLM-as-judge approaches, runs in seconds instead of minutes, and catches the failures most likely to reach production: serving users answers grounded in the wrong information.