Testing¶
Running Unit Tests¶
# Run all tests
uv run pytest
# Run a specific test file
uv run pytest tests/py_tests/test_utils.py
# Exclude slow tests
uv run pytest -m "not slow"
# Run with randomized order (enabled by pytest-randomly)
uv run pytest -p randomly
Test Markers¶
| Marker | Description |
|---|---|
@pytest.mark.slow |
Long-running tests, skipped with -m "not slow" |
Type Checking¶
Configured in pyproject.toml with strict mode and the pydantic plugin.
Linting and Formatting¶
# Lint
uv run ruff check bsllmner2/ tests/ scripts/
# Format
uv run ruff format bsllmner2/ tests/ scripts/
# Format check (for CI)
uv run ruff format --check bsllmner2/ tests/ scripts/
Mutation Testing¶
mutmut validates that tests can detect code mutations.
Target modules are configured in pyproject.toml:
Model Evaluation¶
The tests/model-evaluation/ directory contains scripts for benchmarking LLM models on ontology mapping accuracy.
Datasets (hosted on Zenodo):
600 BioSample entries evaluated against a human-curated gold standard. Data files are stored in tests/data/ (see tests/data/README.md for details).
Evaluated models: deepseek-r1 (8b/32b), gemma3 (4b/12b/27b), gpt-oss (20b), llama3.1 (8b), phi4 (14b), qwen3 (4b/8b/32b)
Metrics: Precision, Recall, F1-score, Accuracy (for the cell_line field)
For the full evaluation workflow (batch execution, metric computation, result aggregation), see tests/model-evaluation/README.md.