RAG Interview Questions: Retrieval, Chunking, Reranking & Evals
Retrieval-augmented generation (RAG) has become a standard interview topic for Forward Deployed Engineers, applied-AI engineers, and anyone wiring LLMs into private data. The questions are less about memorizing definitions and more about judgment: when a RAG system answers confidently but wrongly, can you reason about which stage failed and what you would change?
This page is a question bank covering the RAG pipeline end to end: chunking, embeddings, vector search, reranking, context assembly, hallucination mitigation, and evaluation. Each question includes how to approach it. Read it to sharpen your vocabulary and trade-offs, then rehearse the reasoning out loud, which is what the round actually grades.
Retrieval pipeline questions
The core of any RAG interview is the retrieval pipeline: how documents become chunks, chunks become vectors, and a query pulls back the right context. Interviewers want to hear the trade-offs, not a memorized diagram.
How do you choose a chunking strategy?
There is no universal size. Too-large chunks dilute the signal and waste context; too-small chunks lose the surrounding meaning. Talk through fixed-size chunks with overlap versus semantic or structure-aware splitting (by heading, paragraph, or code block), and tie the choice to the document type and the questions users ask.
What are embeddings and how does vector search work?
An embedding maps text to a vector so that semantically similar text lands nearby. Retrieval embeds the query and finds nearest neighbors by cosine similarity, usually via an approximate index (such as HNSW) for speed. Note that embedding quality and the match between your embedding model and your domain matter more than the index.
When would you use hybrid search over pure vector search?
Vector search captures meaning but misses exact matches, product codes, error strings, rare names, that keyword (BM25) search nails. Hybrid search combines both and reconciles the two ranked lists (for example with reciprocal rank fusion). This is one of the highest-signal answers you can give.
What does a reranker do and when is it worth it?
A cross-encoder reranker re-scores the top candidates by looking at the query and each document together, which is more accurate than the initial bi-encoder retrieval but too slow to run over the whole corpus. The pattern is retrieve many cheaply, then rerank a small set precisely. Mention the added latency as the cost.
How do you keep the index fresh?
Stale or missing documents are a top cause of wrong answers. Discuss incremental re-indexing on document change, handling deletes and updates (not just inserts), and versioning so you can tell which snapshot answered a query.
Answer quality and hallucination questions
Once context is retrieved, the questions shift to how it reaches the model and how you keep the model honest. The single most important instinct: a confident wrong answer is usually a retrieval failure, not a model failure.
A RAG assistant answers confidently but wrongly. How do you debug it?
Check what was retrieved before blaming the model. Log the retrieved chunks for the failing query. If the right context is absent, the bug is upstream (chunking, embeddings, search); if it is present but ignored, the bug is in prompting or context assembly. Naming this split is exactly what interviewers listen for.
How do you reduce hallucination in a RAG system?
Ground the model: instruct it to answer only from the provided context and to say "I do not know" when the context is insufficient, cite sources so answers are checkable, and, when confidence is low, decline rather than guess. Better retrieval reduces hallucination more than prompt tweaks do.
How do you fit retrieved context into a limited context window?
You cannot stuff everything in. Retrieve more than you need, then rerank and trim to a budget, deduplicate near-identical chunks, and order the most relevant context where the model attends best. Mention that adding more context is not always better; irrelevant context can hurt.
How should the model handle conflicting or outdated sources?
Prefer recency and authority: attach metadata (source, timestamp) to chunks and let the prompt or reranker weigh it, and surface the conflict to the user rather than silently picking one. This is a judgment answer, so reason about the customer, not just the code.
When is RAG the wrong tool?
If the task needs reasoning over the whole corpus (broad summarization), or the knowledge is small and static enough to fit in the prompt, or it needs behavior change rather than new facts (where fine-tuning fits better), RAG may be overkill. Knowing when not to reach for it is a senior signal.
Evaluation and production questions
Strong candidates separate retrieval quality from answer quality so they know which half broke, and they measure changes instead of eyeballing them. Evaluation questions are where many candidates go vague, so be specific.
How do you evaluate a RAG system?
Evaluate the two stages separately. For retrieval, use a labeled set and metrics like recall@k and precision@k (did the right chunk come back?). For generation, score faithfulness (is the answer grounded in the retrieved context?) and answer relevance, often with a fixed question-and-expected-answer set and, increasingly, an LLM-as-judge with human spot-checks.
What is an eval set and how do you build one?
A fixed collection of representative questions with expected answers or expected source chunks. Build it from real user queries and known-hard cases, keep it version-controlled, and run it on every change so you can measure regressions instead of guessing.
How do you measure retrieval quality specifically?
Label which chunks are relevant for each test query, then compute recall@k (fraction of relevant chunks retrieved) and precision@k, plus mean reciprocal rank if position matters. This isolates retrieval from generation so a fix is targeted.
How do you handle latency and cost in production RAG?
Embedding and reranking add hops. Cache embeddings and frequent-query results, cap how many chunks you rerank, and consider a smaller model for reranking. Always ask which metric hurts, latency, cost, or quality, before naming a lever.
How do you monitor a RAG system after launch?
Log queries, retrieved chunks, and answers; track "I do not know" and low-confidence rates; and sample real traffic for periodic human review. Treat retrieval misses and hallucinations as separate alerts because they have different fixes.
Practice applied-AI drills in the browser, free
Practice applied-AI drills in the browser, free →Frequently asked questions
What are the most common RAG interview questions?
Chunking strategy and trade-offs, how embeddings and vector search work, when to use hybrid (vector plus keyword) search, what a reranker does, how to debug a confident wrong answer, how to reduce hallucination, and how to evaluate the system, separating retrieval quality from answer quality.
How do you answer a RAG debugging question?
Check what was retrieved before blaming the model. A confident wrong answer usually means the right context never reached the model, an upstream retrieval failure. Log the retrieved chunks, decide whether the right context is absent (fix chunking, embeddings, or search) or present but ignored (fix prompting and context assembly).
What is the difference between RAG, fine-tuning, and prompting?
Prompting steers a model with instructions and examples in the context. RAG injects fresh external knowledge at query time via retrieval, ideal for changing or private facts. Fine-tuning adjusts the model weights to change behavior, tone, or format, but it is a poor way to teach new facts. RAG for knowledge, fine-tuning for behavior, prompting for steering.
How do you evaluate a RAG system in an interview answer?
Evaluate retrieval and generation separately. For retrieval, use a labeled set with recall@k and precision@k. For generation, score faithfulness (grounded in the retrieved context) and relevance against a fixed question-and-expected-answer set, often with an LLM-as-judge plus human spot-checks, and run it on every change to catch regressions.