Skip to content

Select Mode

A 3-stage pipeline that performs NER (like Extract mode) and then maps the extracted results to ontology terms.

Overview

Select mode internally runs Extract before performing ontology mapping. There is no need to run Extract separately.

BioSample JSON
      |
      v
+--------------------------------------------+
| Stage 1: NER Extraction                    |
| dynamic prompt + schema from SelectConfig  |
+--------------------------------------------+
      |
      v
+--------------------------------------------+
| Stage 2: Ontology Search                   |
| 2a. Word-combination search (index lookup) |
| 2b. text2term fallback (OWL only)          |
+--------------------------------------------+
      |
      v
+--------------------------------------------+
| Stage 3: LLM Selection                     |
| choose best term_id from candidates        |
+--------------------------------------------+
      |
      v
SelectResult

CLI Options

Common Options

Option Description Default
--bs-entries Path to the input JSON or JSONL file containing BioSample entries (required) --
--model LLM model to use for NER llama3.1:70b
--thinking BOOL Enable or disable thinking mode for the LLM (true/false) false
--max-entries Process only the first N entries (-1 for all) -1
--ollama-host Host URL for the Ollama server http://localhost:11434
--debug Enable debug mode for more verbose logging false
--run-name Name of the run for identification purposes {model}_{timestamp}
--resume Resume from the last incomplete run false
--batch-size Number of entries to process in each batch 1024
--num-ctx Context length for Ollama 4096

Select-Specific Options

Option Description Default
--mapping Path to the mapping file in TSV format (for evaluation) None
--select-config Path to the select configuration file in JSON format (required) --
--no-reasoning Disable reasoning step during selection false

Usage Example

bsllmner2_select \
  --bs-entries tests/data/example_biosample.json \
  --model llama3.1:70b \
  --select-config scripts/select-config.json \
  --debug

Stage 1: NER Extraction

SelectConfig field definitions are used to dynamically generate a prompt and JSON Schema, then extract entities using the same ner() function as Extract mode.

  • Prompt: build_extract_prompt_for_select() constructs the prompt from each field's prompt_description and value_type
  • Schema: build_extract_schema_for_select() generates a JSON Schema from field definitions ("string" -> ["string", "null"], "array" -> ["array", "null"])

Searches the ontology index for extracted values.

  • Indexes are built from OWL files (via owlready2) or TSV/CSV files (term_id, prop_uri, value)
  • Searches against rdfs:label, skos:prefLabel, and various synonym properties (oboInOwl:hasExactSynonym, etc.)
  • rdfs:comment is also extracted from OWL files as term-level metadata (not used for search/matching, but included in candidate info for Stage 3 LLM context)
  • ontology_filter can restrict entries (e.g., {"hasDbXref": "NCBI_TaxID:9606"} for human only)
  • Exact match with a single term_id is finalized immediately; ambiguous or missing matches proceed to Stage 3
  • For OWL files, text2term.map_terms() is used as a similarity-based fallback

Stage 3: LLM Selection

For fields not resolved in Stage 2, candidates from ontology search and text2term are merged and presented to the LLM, which selects the best term_id. Runs in parallel with asyncio.gather + Semaphore(256).

When --no-reasoning is specified, the reasoning field is omitted from the output schema.

Select Config Customization

Select mode is configured via a JSON file (--select-config). Each field defines an extraction target with its ontology mapping. Several pre-built configs are available in scripts/ (e.g., select-config.json, select-config-hg38.json, select-config-mm10.json).

To create a custom config, define fields as follows:

{
  "fields": {
    "your_field_name": {
      "ontology_file": "/path/to/ontology.owl",
      "prompt_description": "Description of what to extract for NER prompt",
      "ontology_filter": { "hasDbXref": "NCBI_TaxID:9606" },
      "value_type": "string"
    }
  }
}
Field Type Default Description
ontology_file string \| null null Ontology file path (.owl or .tsv/.csv). If null, uses the extracted value as-is without ontology mapping
prompt_description string \| null null Field description to include in the NER prompt
ontology_filter Dict[str, str] \| null null Filter condition for OWL entries
value_type "string" \| "array" "string" Extracted value type. array supports multiple values

Resume

When --resume is specified, processing continues from the previous interruption. Resume files are automatically deleted after successful completion.

The same --run-name must be specified as the original run. If the original run used the auto-generated name ({model}_{timestamp}), you need to find it from the resume file in bsllmner2-results/select/.

Result Files

See Data Formats for the full result schema.

File Description
bsllmner2-results/select/select_{run_name}.json Select result (contains both extract and select entries)