Select Mode¶

A 3-stage pipeline that performs NER (like Extract mode) and then maps the extracted results to ontology terms.

Overview¶

Select mode internally runs Extract before performing ontology mapping. There is no need to run Extract separately.

BioSample JSON
      |
      v
+--------------------------------------------+
| Stage 1: NER Extraction                    |
| dynamic prompt + schema from SelectConfig  |
+--------------------------------------------+
      |
      v
+--------------------------------------------+
| Stage 2: Ontology Search                   |
| 2a. Word-combination search (index lookup) |
| 2b. text2term fallback (OWL only)          |
+--------------------------------------------+
      |
      v
+--------------------------------------------+
| Stage 3: LLM Selection                     |
| choose best term_id from candidates        |
+--------------------------------------------+
      |
      v
SelectResult

CLI Options¶

Common Options¶

Option	Description	Default
`--bs-entries`	Path to the input JSON or JSONL file containing BioSample entries (required)	--
`--model`	LLM model to use for NER	`llama3.1:70b`
`--thinking BOOL`	Enable or disable thinking mode for the LLM (`true`/`false`)	`false`
`--max-entries`	Process only the first N entries (`-1` for all)	`-1`
`--ollama-host`	Host URL for the Ollama server	`http://localhost:11434`
`--debug`	Enable debug mode for more verbose logging	`false`
`--run-name`	Name of the run for identification purposes	`{model}_{timestamp}`
`--resume`	Resume from the last incomplete run	`false`
`--batch-size`	Number of entries to process in each batch	`1024`
`--num-ctx`	Context length for Ollama	`4096`

Select-Specific Options¶

Option	Description	Default
`--mapping`	Path to the mapping file in TSV format (for evaluation)	`None`
`--select-config`	Path to the select configuration file in JSON format (required)	--
`--no-reasoning`	Disable reasoning step during selection	`false`

Usage Example¶

bsllmner2_select \
  --bs-entries tests/data/example_biosample.json \
  --model llama3.1:70b \
  --select-config scripts/select-config-hg38.json \
  --debug

Stage 1: NER Extraction¶

SelectConfig field definitions are used to dynamically generate a prompt and JSON Schema, then extract entities using the same ner() function as Extract mode.

Prompt: build_extract_prompt_for_select() constructs the prompt from each field's prompt_description and value_type
Schema: build_extract_schema_for_select() generates a JSON Schema from field definitions ("string" -> ["string", "null"], "array" -> ["array", "null"])

Stage 2: Ontology Search¶

Searches the ontology index for extracted values. At run start-up, two caches are prepared once per process so that per-batch work is kept to lookups only:

build_index_map() loads or rebuilds the word-combination OntologyIndex per ontology file (ontology/index_cache/)
build_text2term_cache() registers each OWL with text2term via text2term.cache_ontology(acronym=...) (ontology/text2term_cache/) so later map_terms() calls skip OWL parsing

Per batch, Stage 2 then runs:

Stage 2a: Word-combination search (ontology_search_sec). Indexes are built from pre-subsetted OWL files (via owlready2) or TSV/CSV files (term_id, prop_uri, value). The subsets are generated by scripts/build_subset_ontologies.sh (a thin wrapper around sh-ikeda/ontology-constructor-for-bsllmner SPARQL templates + ROBOT). Searches against rdfs:label, skos:prefLabel, and various synonym properties (oboInOwl:hasExactSynonym, etc.).
obo:IAO_0000115 (textual definition) is collected per term and surfaced as definitions on each candidate (not used for search/matching, but passed to the Stage 3 LLM as context)
rdfs:comment is surfaced as comments. In the default subset OWLs only ChEBI populates it (with has_role info injected upstream as "{role_type}: {role_label}"); other ontologies leave it empty
All species / hierarchy filtering is encoded at ontology build time (per-species Cellosaurus OWLs plus CL / UBERON / MONDO / ChEBI subsets); no runtime filter is applied
Exact match with a single term_id is finalized immediately; ambiguous or missing matches proceed to Stage 3
Stage 2b: text2term fallback (text2term_sec). For OWL files only, text2term.map_terms(..., target_ontology=<acronym>, use_cache=True, cache_folder=BSLLMNER2_TEXT2TERM_CACHE_DIR) is used as a similarity-based fallback. The acronym is {ontology_file_stem}_nofilter, matching the cache key used by the word-combination index. When the text2term cache build fails (e.g. read-only cache dir), the call falls back to target_ontology=<owl_path> with use_cache=False and the run continues with a warning

Stage 3: LLM Selection¶

For fields not resolved in Stage 2, candidates from ontology search and text2term are merged and presented to the LLM, which selects the best term_id. Runs in parallel with asyncio.gather + Semaphore(256).

When --no-reasoning is specified, the reasoning field is omitted from the output schema.

Select Config Customization¶

Select mode is configured via a JSON file (--select-config). Each field defines an extraction target with its ontology mapping. Pre-built configs are available in scripts/ (select-config-hg38.json, select-config-mm10.json, select-config-plants.json).

To create a custom config, define fields as follows:

{
  "fields": {
    "your_field_name": {
      "ontology_file": "/path/to/ontology.owl",
      "prompt_description": "Description of what to extract for NER prompt",
      "value_type": "string"
    }
  }
}

Field	Type	Default	Description
`ontology_file`	`string \\| null`	`null`	Ontology file path (.owl or .tsv/.csv). If null, uses the extracted value as-is without ontology mapping
`prompt_description`	`string \\| null`	`null`	Field description to include in the NER prompt
`value_type`	`"string" \\| "array"`	`"string"`	Extracted value type. `array` supports multiple values

Resume¶

When --resume is specified, processing continues from the previous interruption. Resume files are automatically deleted after successful completion.

The same --run-name must be specified as the original run. If the original run used the auto-generated name ({model}_{timestamp}), you need to find it from the resume file in bsllmner2-results/select/.

Result Files¶

See Data Formats for the full result schema.

File	Description
`bsllmner2-results/select/select_{run_name}.json`	Select result (contains both extract and select entries)