Performance Testing of LLM Inference Endpoints
Large language model endpoints are no longer experimental side services. They are production APIs with user-facing latency expectations, cost constraints, and availability targets that must hold under real traffic patterns. For QA teams, testing these endpoints requires more than sending a few prompts and checking whether a completion is returned.
LLM inference is a different performance problem. A response can begin quickly but finish slowly, or appear fast for short prompts and fall apart when token counts grow. Queueing, batching, GPU saturation, context length, and streaming behavior all change the shape of the results. That is why performance testing for model serving needs a workload model, measurable thresholds, and a repeatable harness.
Why LLM endpoints need specialized performance testing
Traditional API tests usually focus on response time, error rate, and throughput. Those metrics still matter for LLM inference, but they do not tell the whole story. A model endpoint can pass ordinary latency checks and still fail in production because output tokens slow down after the first token, requests pile up in a batcher, or long prompts trigger memory pressure.
In practice, QA teams should evaluate both the infrastructure and the user experience. Infrastructure metrics tell you whether the serving stack is healthy. User-facing metrics tell you whether the application feels responsive enough for chat, summarization, extraction, or agent workflows.
What makes LLM serving different
- Prompt length varies widely. A 20-token request and a 4,000-token request exercise very different parts of the system.
- Output is streamed token by token. First-token latency and sustained token throughput often matter more than a single total response time.
- Dynamic batching changes timing. The server may delay an individual request to combine it with others for better GPU utilization.
- Context windows consume memory. Large prompts can reduce concurrency long before CPU reaches its limit.
- Sampling parameters affect work per request. Temperature, top-p, max tokens, and tool-calling behavior can significantly change runtime.
Core metrics to measure
Before running load or stress tests, define the metrics that matter to your product. The right metrics depend on the endpoint contract, but most QA teams should collect a common baseline.
Latency metrics
- Time to first token (TTFT): The time from request submission until the first streamed token arrives. This is critical for chat experiences.
- End-to-end latency: The total time until the response completes. This is still useful for non-streaming workloads and batch jobs.
- P50, P95, and P99 latency: Percentiles show tail behavior, which is often where model-serving systems fail under load.
Throughput metrics
- Requests per second: Good for coarse capacity checks, but not sufficient by itself.
- Tokens per second: Often the most meaningful measure for inference efficiency because generation cost scales with token output.
- Concurrent active requests: Helps reveal queue buildup, GPU contention, and scheduler behavior.
Quality-of-service metrics
- Error rate: Timeouts, 429s, 5xx responses, and malformed streaming events.
- Timeout distribution: Distinguish between client-side timeouts and server-side failures.
- Stability over time: Track whether performance degrades after sustained load, restarts, or autoscaling events.
Designing realistic LLM workloads
A useful benchmark reflects how the endpoint is actually used. Synthetic traffic that sends identical prompts with identical output lengths can hide bottlenecks or create misleading optimism. Instead, build a workload mix that matches real usage patterns.
Model the request mix
Start by categorizing the request types your endpoint supports. For example, a support chatbot may see short greetings, medium-length troubleshooting prompts, and long context-rich conversations. A document extraction service may receive a small number of very large prompts with predictable output shapes.
- Short prompts: Used to test baseline responsiveness and overhead.
- Medium prompts: Useful for measuring realistic chat or summarization traffic.
- Long-context prompts: Reveal memory pressure and queueing behavior.
- High-output prompts: Expose token-generation bottlenecks and streaming performance.
Vary generation settings intentionally
Performance can shift dramatically when output length or sampling configuration changes. Keep your benchmark matrix explicit so results remain interpretable.
- Max output tokens: Test small, medium, and worst-case outputs.
- Streaming on and off: Compare client-perceived latency with full-response delivery.
- Temperature and top-p: Confirm whether different sampling settings affect compute cost or batching behavior.
- Tool invocation or function calling: Include these paths if the production application relies on them.
A practical test strategy
A mature performance plan usually includes four layers: baseline benchmarking, load testing, stress testing, and soak testing. Each layer answers a different question about the endpoint.
1. Establish a baseline
Run a small number of requests in a controlled environment. Measure TTFT, total latency, token throughput, and error rate for representative prompts. The goal is to understand normal behavior before concurrency, batching, and backpressure complicate the picture.
2. Run load tests
Increase concurrency gradually until the system reaches the expected production load. Observe where latency begins to climb and whether throughput plateaus. For LLM services, the most useful question is often not “How many requests can it take?” but “At what point does token generation become unacceptably slow?”
3. Push into stress conditions
Continue increasing load past the expected peak. Stress tests uncover queue overflows, OOM failures, connection resets, and autoscaling delays. They also show whether the system degrades gracefully or collapses abruptly.
4. Soak the endpoint
Hold the service under a sustained, realistic load for hours. Soak testing helps reveal memory leaks, GPU fragmentation, cache churn, and temperature-related throttling. Many LLM deployments look healthy in short bursts and then degrade over time.
Instrumentation: what to capture
Without good telemetry, LLM performance testing turns into guesswork. Your harness should record both client-side and server-side signals so you can connect symptoms to root causes.
Client-side measurements
- Request start and end timestamps
- First-token arrival time
- Streaming chunk intervals
- Response status codes and retry counts
- Prompt size and output token counts
Server-side measurements
- GPU utilization and memory usage
- CPU usage, queue depth, and worker saturation
- Batch size distribution
- Model load time and warm-cache behavior
- Timeouts, 429s, and internal exceptions
Business-level signals
- Estimated cost per 1,000 requests
- Estimated cost per 1 million output tokens
- User abandonment risk when TTFT exceeds the acceptable threshold
- SLO compliance for chatbot or agent workflows
Example: a simple latency and throughput harness
The following JavaScript example shows the shape of a lightweight test client that measures TTFT, total latency, and token counts for a streaming endpoint. In a real test suite, you would add retries, correlation IDs, structured logging, and workload parameterization.
import fetch from 'node-fetch';
async function testInferenceEndpoint({ url, prompt, maxTokens }) {
const start = Date.now();
let firstTokenAt = null;
let tokenCount = 0;
const response = await fetch(url, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt, max_tokens: maxTokens, stream: true })
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { value, done } = await reader.read();
if (done) break;
if (firstTokenAt === null) firstTokenAt = Date.now();
buffer += decoder.decode(value, { stream: true });
const chunks = buffer.split('\n');
buffer = chunks.pop();
for (const chunk of chunks) {
if (chunk.trim()) tokenCount += 1;
}
}
const end = Date.now();
return {
ttftMs: firstTokenAt ? firstTokenAt - start : null,
totalLatencyMs: end - start,
tokenCount
};
}
This style of harness is useful because it mirrors what users experience. If the endpoint streams tokens, your test should measure streaming behavior directly instead of waiting for a final JSON blob and calling that “latency.”
Common bottlenecks in model serving
When LLM endpoints slow down, the cause is not always the model itself. Often the bottleneck is in the serving layer, request scheduling, or infrastructure around the model.
- Cold starts: Model weights need to load or initialize after deployment, scaling, or restarts.
- Batching delays: Larger batches improve throughput but may hurt TTFT.
- GPU memory exhaustion: Long contexts and concurrent requests can force eviction or failure.
- Network overhead: Streaming responses can be delayed by proxy settings, buffering, or TLS termination.
- Tokenizer overhead: For some workloads, tokenization and prompt preprocessing become visible at scale.
How to interpret results
Do not judge an inference endpoint by a single average response time. Averages hide the tail, and the tail is where production incidents usually happen. Instead, read performance results as a system profile.
- Flat throughput with rising latency: The system is saturating.
- TTFT rising before total latency: Queueing or batching is delaying request admission.
- Token throughput dropping under load: The GPU or serving stack is near capacity.
- Errors appearing only during spikes: Autoscaling or connection handling is too slow.
Always compare test runs against a defined service target. For example, “P95 TTFT under 800 ms for short chat prompts” is more actionable than “system seems fast.”
QA checklist for LLM inference endpoints
- Confirm the endpoint supports the expected prompt and output ranges.
- Measure TTFT, end-to-end latency, token throughput, and error rate.
- Include short, medium, and long prompts in the workload mix.
- Test both streaming and non-streaming response modes.
- Run load, stress, and soak tests before release.
- Correlate client metrics with GPU, CPU, and queue telemetry.
- Validate autoscaling behavior and recovery after saturation.
- Document service targets and regression thresholds for future releases.
Final thoughts
Performance testing LLM inference endpoints is really about understanding the relationship between user experience, compute cost, and serving architecture. The best QA teams treat latency testing and token throughput as first-class metrics, build representative workloads, and watch for failure modes that only appear at scale.
If your application depends on model serving, make performance testing part of every release cycle. The endpoint may still return a correct answer under pressure, but that does not mean it is ready for production if the answer arrives too late to matter.