Skip to content

Benchmarking

Evaluation axes

Benchmarking covers two orthogonal axes:

Axis Metrics Scope
Performance tokens/sec, latency, wall-clock time Both Extract and Select modes
Accuracy precision, recall, F1 vs human-curated gold standard Select mode only

Why accuracy evaluation is Select-mode only

The mapping TSV extraction_answer column is the output of a previous tool (MetaSRA), not a human-curated ground truth. Therefore it cannot serve as a reliable gold standard for Extract evaluation.

The mapping_answer_id column, on the other hand, is human-curated and compares the pipeline's final ontology term selection against a known-correct answer. Accuracy evaluation is performed exclusively via Select mode using this column.

Why tokens/sec, not GPU utilization

nvidia-smi reports SM (Streaming Multiprocessor) utilization, which measures compute occupancy. LLM inference, however, is memory-bandwidth-bound: the bottleneck is moving weights from VRAM to the compute units, not the compute itself. A GPU can show 5% SM utilization while being completely saturated on memory bandwidth.

tokens/sec (eval_count / eval_duration) directly measures how fast the model is generating output tokens. This is the correct metric for:

  • Comparing pipeline configurations (parallelism, batch size)
  • Detecting GPU saturation
  • Estimating wall-clock time for a given workload

LLM timing fields

Each ExtractEntry persists an LlmTimingFields object with the subset of Ollama timing data needed for benchmarking. ChatResponse objects are kept in memory during the run for aggregate statistics but are not saved to disk.

Every ChatResponse from Ollama contains nanosecond-precision timing data:

total_duration = load_duration + prompt_eval_duration + eval_duration + internal overhead
Field Unit Description
total_duration ns Wall-clock time inside Ollama for this request
load_duration ns Model load/unload time (high on cold start)
prompt_eval_count tokens Number of prompt tokens evaluated
prompt_eval_duration ns Time spent on prompt evaluation
eval_count tokens Number of generated tokens
eval_duration ns Time spent on token generation

Diagnosing execution time variance

Execution time varies between runs even with identical inputs. The performance field in the result JSON provides data to isolate the cause.

Layer 1: LLM internal

  • load_duration spikes: Model was unloaded from GPU between requests. Indicates Ollama evicted the model (memory pressure, timeout). Check max_load_duration_sec and mean_load_duration_sec.
  • eval_duration / eval_count variance: Per-token generation speed fluctuates. May indicate KV cache pressure or Ollama queuing.
  • total_duration vs wall-clock gap: If sum(total_duration) is much less than wall-clock time, requests are spending time in the Ollama queue.

Layer 2: GPU / hardware

  • Thermal throttling reduces clock speed under sustained load.
  • Memory bandwidth contention from other processes sharing the GPU.
  • Compare tokens_per_sec p50 vs p99: a large gap suggests intermittent hardware interference.

Layer 3: Pipeline structure

  • asyncio.gather takes the maximum of all coroutine durations, so one slow request dominates batch time.
  • Resume file writes (resume_write_sec) grow linearly with accumulated outputs.

Layer 4: OS / environment

  • On shared clusters (e.g., NIG Slurm), other jobs compete for GPU, network, and storage.
  • Compare runs at different times to isolate environmental noise.

Detecting GPU saturation

GPU saturation means adding more parallelism no longer increases total throughput.

Method

  1. Run the same workload with different concurrency levels (e.g., 1, 4, 16, 64, 256).
  2. From the result's performance field, compute:
  3. T_N: mean per-request tokens/sec at concurrency N
  4. N * T_N: estimated total throughput

Interpretation

Observation Meaning
N * T_N increases linearly GPU is underutilized; increase concurrency
N * T_N plateaus GPU is saturated; optimal concurrency reached
N * T_N decreases Contention overhead; reduce concurrency
T_N drops sharply at some N Queue pressure; consider the previous N as optimal

Ensuring reproducibility

Warm-up

Cold starts inflate load_duration. Before benchmarking, send a few dummy requests to ensure the model is loaded and the KV cache is initialized.

Multiple runs

Report median +- IQR (interquartile range) over at least 3 runs. Mean is sensitive to outliers; median is not.

Normalize by tokens/sec

Wall-clock time depends on the number of tokens generated, which varies with input. Use tokens_per_sec to compare across different inputs.

Reading PerformanceSummary

Performance data is embedded in the result file's performance field (both ExtractResult and SelectResult).

Top-level fields

Field What to check
performance.total_wall_sec End-to-end wall-clock time
performance.total_input_entries / performance.completed_count Did all entries complete?
evaluation.accuracy / evaluation.precision / evaluation.recall / evaluation.f1 Accuracy regression check (Select mode only, in SelectResult.evaluation)

performance.ner_llm_timing / performance.select_llm_timing

Field What to check
mean_tokens_per_sec GPU throughput indicator
p50_tokens_per_sec Typical throughput
mean_load_duration_sec Warm-up effectiveness
max_load_duration_sec Worst-case model load
p50_latency_sec vs p99_latency_sec Tail latency (outlier detection)
total_prompt_tokens / total_eval_tokens Token budget

performance.stage_timings[]

Per-batch breakdown. Compare ner_sec, ontology_search_sec, text2term_sec, llm_select_sec to identify the bottleneck stage.

performance.disk_io

Field What to check
index_cache_load_sec Cache hit speed
index_cache_save_sec First-run overhead
resume_write_sec I/O growth over batches