LLM Interview Questions: Tokens, Sampling, Fine-Tuning & Agents

Updated July 2026 · Rung

Large language model (LLM) questions now show up across Forward Deployed Engineer, applied-AI, and platform-engineering interviews. You do not need to be a researcher, but you do need to speak the trade-offs fluently: how tokens and context limits constrain you, when to reach for prompting versus RAG versus fine-tuning, how sampling settings change behavior, and how to serve models fast and cheap without cutting quality.

This is a question bank covering the LLM topics interviewers actually probe, grouped by theme, each with a short note on how to approach it. Read it to calibrate your vocabulary and judgment, then rehearse the reasoning out loud, because the round grades how you weigh trade-offs, not whether you can recite a definition.

Fundamentals: tokens, context, and sampling

These questions check that you understand what the model actually operates on and how its outputs are shaped. Getting the fundamentals right is table stakes for everything downstream.

What is tokenization and why does it matter?

Models process text as tokens (subword units), not characters or words, so a token is roughly three to four characters of English on average. It matters because cost, latency, and context limits are all counted in tokens, and because odd tokenization explains quirks like weak character-level counting. Tie it back to budgeting a prompt.

What is the context window and how do you work within it?

The context window is the maximum tokens of input plus output the model can attend to at once. When content exceeds it, you must summarize, chunk, or retrieve rather than dump everything in, and even within the limit, more context is not always better because irrelevant text can dilute attention.

What do temperature, top-p, and top-k control?

They shape sampling. Temperature scales randomness (low is focused and near-deterministic, high is creative and riskier); top-p (nucleus) and top-k restrict sampling to the most probable tokens. For extraction or classification, keep temperature low; for brainstorming, raise it. Name the task before naming the setting.

How do you get structured or reliable output from an LLM?

Constrain and validate. Ask for a specific format (JSON with a named schema), use structured-output or function-calling modes where available, keep temperature low, and validate the result, retrying or repairing on a parse failure rather than trusting the first response.

Why do LLMs hallucinate, and what reduces it?

They generate the most likely next token, not verified truth, so with thin or missing grounding they produce fluent but wrong text. Reduce it by grounding answers in retrieved context (RAG), instructing the model to decline when unsure, asking for citations, and lowering temperature for factual tasks.

Adapting models: prompting, RAG, fine-tuning, and agents

The highest-signal LLM questions are about choosing the right adaptation technique for a problem. Interviewers want to hear you match the tool to the need rather than defaulting to the most powerful one.

When do you use prompting vs RAG vs fine-tuning?

Prompting (instructions and few-shot examples) steers behavior cheaply. RAG injects fresh or private knowledge at query time. Fine-tuning changes the weights to shift behavior, tone, or format. The rule: RAG for knowledge, fine-tuning for behavior, prompting for steering. Reaching for fine-tuning to teach new facts is a classic wrong answer.

What is few-shot prompting and when does it help?

Including a few input-output examples in the prompt to demonstrate the task, format, or edge cases. It helps most on format-sensitive or ambiguous tasks and when you cannot fine-tune. The trade-off is that examples consume context and add latency and cost.

How does an LLM agent use tools, and what are the guardrails?

An agent loops: the model decides to call a tool (search, code, an API), reads the result, and continues until it is done. The guardrails matter most in an interview: a max-step budget, explicit termination conditions, idempotent tools so a retried action is safe, and validation of tool inputs and outputs.

What is the difference between fine-tuning and RLHF?

Supervised fine-tuning trains on input-output pairs to teach a task or style. RLHF (reinforcement learning from human feedback) further tunes the model against a reward model built from human preferences, which is how base models are aligned to be helpful and harmless. Most product teams do supervised fine-tuning, not RLHF.

When is a smaller model the right call?

Often. A smaller or distilled model can hit latency and cost targets a frontier model cannot, and for narrow, well-scoped tasks it can match quality after prompting or light fine-tuning. Framing the choice around the specific quality bar and budget is a senior signal.

Serving, evaluation, and safety

These questions probe whether you can put an LLM into production responsibly: fast enough, measured, and safe. Vague answers here are common, so be specific with the numbers and the failure modes.

What inference metrics should you know?

Latency versus throughput, time-to-first-token, tokens per second, and p50 versus p99 tail latency. Streaming improves perceived latency by showing tokens as they generate. When asked to optimize, ask which metric hurts, latency, throughput, or cost, before naming a lever.

How do you make LLM serving faster or cheaper?

Levers include batching and continuous batching for throughput, KV-cache reuse, quantization (FP16 to INT8 to INT4) to shrink memory and speed up, prompt and response caching, and picking a smaller model for easy queries (a router). Each trades something, so state the trade.

How do you evaluate an LLM application?

Build a fixed set of representative inputs with expected outputs or graded rubrics, and run it on every change to catch regressions. Use exact or programmatic checks where you can, and an LLM-as-judge with human spot-checks for open-ended output. Measure, do not eyeball.

How do you handle prompt injection and unsafe output?

Treat model input and output as untrusted: never let retrieved or user text silently override system instructions, sandbox and least-privilege any tools the model can call, filter or moderate outputs, and keep a human in the loop for high-stakes actions. This is increasingly a required answer, not a bonus.

How do you control cost in an LLM product?

Cost is tokens times price. Trim prompts and retrieved context, cache repeated calls, route easy queries to a cheaper model, cap output length, and monitor per-request token usage. Tie every lever to a measurable spend, not a hunch.

Practice applied-AI drills in the browser, free

Practice applied-AI drills in the browser, free →

Frequently asked questions

What are common LLM interview questions?

Tokenization and context windows, what temperature and top-p control, why models hallucinate and how to reduce it, when to use prompting vs RAG vs fine-tuning, how agents use tools and their guardrails, inference metrics and serving optimizations, how to evaluate an LLM app, and safety topics like prompt injection.

How do I explain the difference between RAG and fine-tuning?

RAG injects fresh or private knowledge at query time through retrieval, so it is the right tool when facts change or are proprietary. Fine-tuning adjusts the model weights to change behavior, tone, or format, and is a poor way to teach new facts. The rule of thumb: RAG for knowledge, fine-tuning for behavior, prompting for steering.

What inference metrics should I know for an LLM interview?

Latency versus throughput, time-to-first-token, tokens per second, and p50 versus p99 tail latency, plus the levers that move them: batching and continuous batching, KV-cache reuse, quantization (FP16/INT8/INT4), caching, and model routing. Always tie an optimization to the specific metric that hurts.

How do you evaluate an LLM in production?

Use a fixed eval set of representative inputs with expected outputs or graded rubrics, run it on every change to catch regressions, apply programmatic checks where possible, and use an LLM-as-judge with human spot-checks for open-ended output. Monitor live traffic for hallucination and safety issues, and measure rather than eyeball.