Skip to content

Extract Mode

Performs Named Entity Recognition (NER) on BioSample records using an LLM to extract biological information in specified categories.

Overview

BioSample JSON
      |
      v
+------------+
| Load       |
| bs_entries |
+------------+
      |
      v
+------------+
| Build      |
| messages   |
+------------+
      |
      v
+------------+
| Ollama     |
| chat()     |
+------------+
      |
      v
+------------+
| Parse JSON |
| response   |
+------------+
  1. Load BioSample entries from JSON/JSONL
  2. Apply prompt (YAML) and format schema (JSON Schema)
  3. Send batch requests to Ollama
  4. Extract and parse JSON from responses

CLI Options

Common Options

Option Description Default
--bs-entries Path to the input JSON or JSONL file containing BioSample entries (required) --
--model LLM model to use for NER llama3.1:70b
--thinking BOOL Enable or disable thinking mode for the LLM (true/false) false
--max-entries Process only the first N entries (-1 for all) -1
--ollama-host Host URL for the Ollama server http://localhost:11434
--debug Enable debug mode for more verbose logging false
--run-name Name of the run for identification purposes {model}_{timestamp}
--resume Resume from the last incomplete run false
--batch-size Number of entries to process in each batch 1024
--num-ctx Context length for Ollama 4096

Extract-Specific Options

Option Description Default
--prompt Path to the prompt file in YAML format bsllmner2/prompt/prompt_extract.yml
--format Path to the JSON schema file for the output format None

Usage Examples

bsllmner2_extract \
  --bs-entries tests/data/example_biosample.json \
  --prompt bsllmner2/prompt/prompt_extract.yml \
  --format bsllmner2/format/cell_line.schema.json \
  --model llama3.1:70b \
  --debug

With Docker:

docker compose exec app bsllmner2_extract \
  --bs-entries tests/data/example_biosample.json \
  --model llama3.1:70b

Prompt Specification

Prompts are defined as a YAML list where each element has role and content.

- role: system
  content: |-
    You are a smart curator of biological data
- role: user
  content: |-
    I will input JSON formatted metadata of a sample...
    Here is the input metadata:

At runtime, the BioSample entry JSON is appended to the content of the last message.

Select-Mode NER Prompt (build_extract_prompt_for_select)

When extraction runs as the first stage of Select mode (driven by the select config, not a standalone prompt YAML), the prompt is synthesized in code by bsllmner2.pipeline.build_extract_prompt_for_select(). Its user message includes two rule blocks:

  • Output rules — JSON-only output, per-field value-type handling, prefer exact mentions, avoid hallucination
  • Category assignment rules — domain-agnostic boundaries that mitigate cross-field leaks observed in large-scale runs:
  • Each extracted value belongs to at most one category; ambiguous values must pick the single most appropriate one by biological meaning
  • Values are classified by biological meaning, not by the attribute key/label in the input (e.g., if an attribute labeled drug actually contains HeLa, it belongs in cell_line)
  • Experimental control terms (negative control, NC, vehicle, mock, empty vector, scramble, non-targeting, shControl, siControl, …) are not extracted into any category — they are experimental conditions, not biological entities

These rules are intentionally generic (no ontology- or field-specific guidance) so bsllmner stays applicable to arbitrary select configs.

Customization

  1. Copy the built-in prompt (bsllmner2/prompt/prompt_extract.yml)
  2. Edit the category descriptions and output rules
  3. Specify the file with --prompt

Output Format Specification

When a JSON Schema is specified with --format, the Ollama structured output feature (format parameter) constrains the output format.

Built-in schema (bsllmner2/format/cell_line.schema.json):

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "cell_line": { "type": ["string", "null"] }
  },
  "required": ["cell_line"],
  "additionalProperties": true
}

If --format is omitted, the LLM responds in free form. The last JSON object/array is extracted from the response using a regex and parsed.

Resume

When --resume is specified, processing continues from the previous interruption. The resume file is automatically deleted after successful completion.

The same --run-name must be specified as the original run. If the original run used the auto-generated name ({model}_{timestamp}), you need to find it from the resume file in bsllmner2-results/extract/.

Result Files

See Data Formats for the full result schema.

File Description
bsllmner2-results/extract/{run_name}.json Complete result
bsllmner2-results/extract/{run_name}_resume.json Resume intermediate file (during processing only)

The default run_name is {model}_{YYYYMMDD_HHMMSS} (UTC).