Select Mode¶
A 3-stage pipeline that performs NER (like Extract mode) and then maps the extracted results to ontology terms.
Overview¶
Select mode internally runs Extract before performing ontology mapping. There is no need to run Extract separately.
BioSample JSON
|
v
+--------------------------------------------+
| Stage 1: NER Extraction |
| dynamic prompt + schema from SelectConfig |
+--------------------------------------------+
|
v
+--------------------------------------------+
| Stage 2: Ontology Search |
| 2a. Word-combination search (index lookup) |
| 2b. text2term fallback (OWL only) |
+--------------------------------------------+
|
v
+--------------------------------------------+
| Stage 3: LLM Selection |
| choose best term_id from candidates |
+--------------------------------------------+
|
v
SelectResult
CLI Options¶
Common Options¶
| Option | Description | Default |
|---|---|---|
--bs-entries |
Path to the input JSON or JSONL file containing BioSample entries (required) | -- |
--model |
LLM model to use for NER | llama3.1:70b |
--thinking BOOL |
Enable or disable thinking mode for the LLM (true/false) |
false |
--max-entries |
Process only the first N entries (-1 for all) |
-1 |
--ollama-host |
Host URL for the Ollama server | http://localhost:11434 |
--debug |
Enable debug mode for more verbose logging | false |
--run-name |
Name of the run for identification purposes | {model}_{timestamp} |
--resume |
Resume from the last incomplete run | false |
--batch-size |
Number of entries to process in each batch | 1024 |
--num-ctx |
Context length for Ollama | 4096 |
Select-Specific Options¶
| Option | Description | Default |
|---|---|---|
--mapping |
Path to the mapping file in TSV format (for evaluation) | None |
--select-config |
Path to the select configuration file in JSON format (required) | -- |
--no-reasoning |
Disable reasoning step during selection | false |
Usage Example¶
bsllmner2_select \
--bs-entries tests/data/example_biosample.json \
--model llama3.1:70b \
--select-config scripts/select-config-hg38.json \
--debug
Stage 1: NER Extraction¶
SelectConfig field definitions are used to dynamically generate a prompt and JSON Schema, then extract entities using the same ner() function as Extract mode.
- Prompt:
build_extract_prompt_for_select()constructs the prompt from each field'sprompt_descriptionandvalue_type - Schema:
build_extract_schema_for_select()generates a JSON Schema from field definitions ("string"->["string", "null"],"array"->["array", "null"])
Stage 2: Ontology Search¶
Searches the ontology index for extracted values. At run start-up, two caches are prepared once per process so that per-batch work is kept to lookups only:
build_index_map()loads or rebuilds the word-combinationOntologyIndexper ontology file (ontology/index_cache/)build_text2term_cache()registers each OWL with text2term viatext2term.cache_ontology(acronym=...)(ontology/text2term_cache/) so latermap_terms()calls skip OWL parsing
Per batch, Stage 2 then runs:
- Stage 2a: Word-combination search (
ontology_search_sec). Indexes are built from pre-subsetted OWL files (via owlready2) or TSV/CSV files (term_id, prop_uri, value). The subsets are generated byscripts/build_subset_ontologies.sh(a thin wrapper aroundsh-ikeda/ontology-constructor-for-bsllmnerSPARQL templates + ROBOT). Searches againstrdfs:label,skos:prefLabel, and various synonym properties (oboInOwl:hasExactSynonym, etc.). obo:IAO_0000115(textual definition) is collected per term and surfaced asdefinitionson each candidate (not used for search/matching, but passed to the Stage 3 LLM as context)rdfs:commentis surfaced ascomments. In the default subset OWLs only ChEBI populates it (withhas_roleinfo injected upstream as"{role_type}: {role_label}"); other ontologies leave it empty- All species / hierarchy filtering is encoded at ontology build time (per-species Cellosaurus OWLs plus CL / UBERON / MONDO / ChEBI subsets); no runtime filter is applied
- Exact match with a single term_id is finalized immediately; ambiguous or missing matches proceed to Stage 3
- Stage 2b: text2term fallback (
text2term_sec). For OWL files only,text2term.map_terms(..., target_ontology=<acronym>, use_cache=True, cache_folder=BSLLMNER2_TEXT2TERM_CACHE_DIR)is used as a similarity-based fallback. The acronym is{ontology_file_stem}_nofilter, matching the cache key used by the word-combination index. When the text2term cache build fails (e.g. read-only cache dir), the call falls back totarget_ontology=<owl_path>withuse_cache=Falseand the run continues with a warning
Stage 3: LLM Selection¶
For fields not resolved in Stage 2, candidates from ontology search and text2term are merged and presented to the LLM, which selects the best term_id. Runs in parallel with asyncio.gather + Semaphore(256).
When --no-reasoning is specified, the reasoning field is omitted from the output schema.
Select Config Customization¶
Select mode is configured via a JSON file (--select-config). Each field defines an extraction target with its ontology mapping. Pre-built configs are available in scripts/ (select-config-hg38.json, select-config-mm10.json, select-config-plants.json).
To create a custom config, define fields as follows:
{
"fields": {
"your_field_name": {
"ontology_file": "/path/to/ontology.owl",
"prompt_description": "Description of what to extract for NER prompt",
"value_type": "string"
}
}
}
| Field | Type | Default | Description |
|---|---|---|---|
ontology_file |
string \| null |
null |
Ontology file path (.owl or .tsv/.csv). If null, uses the extracted value as-is without ontology mapping |
prompt_description |
string \| null |
null |
Field description to include in the NER prompt |
value_type |
"string" \| "array" |
"string" |
Extracted value type. array supports multiple values |
Resume¶
When --resume is specified, processing continues from the previous interruption. Resume files are automatically deleted after successful completion.
The same --run-name must be specified as the original run. If the original run used the auto-generated name ({model}_{timestamp}), you need to find it from the resume file in bsllmner2-results/select/.
Result Files¶
See Data Formats for the full result schema.
| File | Description |
|---|---|
bsllmner2-results/select/select_{run_name}.json |
Select result (contains both extract and select entries) |