Extract Mode¶

Performs Named Entity Recognition (NER) on BioSample records using an LLM to extract biological information in specified categories.

Overview¶

BioSample JSON
      |
      v
+------------+
| Load       |
| bs_entries |
+------------+
      |
      v
+------------+
| Build      |
| messages   |
+------------+
      |
      v
+------------+
| Ollama     |
| chat()     |
+------------+
      |
      v
+------------+
| Parse JSON |
| response   |
+------------+

Load BioSample entries from JSON/JSONL
Apply prompt (YAML) and format schema (JSON Schema)
Send batch requests to Ollama
Extract and parse JSON from responses

CLI Options¶

Common Options¶

Option	Description	Default
`--bs-entries`	Path to the input JSON or JSONL file containing BioSample entries (required)	--
`--model`	LLM model to use for NER	`llama3.1:70b`
`--thinking BOOL`	Enable or disable thinking mode for the LLM (`true`/`false`)	`false`
`--max-entries`	Process only the first N entries (`-1` for all)	`-1`
`--ollama-host`	Host URL for the Ollama server	`http://localhost:11434`
`--debug`	Enable debug mode for more verbose logging	`false`
`--run-name`	Name of the run for identification purposes	`{model}_{timestamp}`
`--resume`	Resume from the last incomplete run	`false`
`--batch-size`	Number of entries to process in each batch	`1024`
`--num-ctx`	Context length for Ollama	`4096`

Extract-Specific Options¶

Option	Description	Default
`--prompt`	Path to the prompt file in YAML format	`bsllmner2/prompt/prompt_extract.yml`
`--format`	Path to the JSON schema file for the output format	`None`

Usage Examples¶

bsllmner2_extract \
  --bs-entries tests/data/example_biosample.json \
  --prompt bsllmner2/prompt/prompt_extract.yml \
  --format bsllmner2/format/cell_line.schema.json \
  --model llama3.1:70b \
  --debug

With Docker:

docker compose exec app bsllmner2_extract \
  --bs-entries tests/data/example_biosample.json \
  --model llama3.1:70b

Prompt Specification¶

Prompts are defined as a YAML list where each element has role and content.

- role: system
  content: |-
    You are a smart curator of biological data
- role: user
  content: |-
    I will input JSON formatted metadata of a sample...
    Here is the input metadata:

At runtime, the BioSample entry JSON is appended to the content of the last message.

Select-Mode NER Prompt (`build_extract_prompt_for_select`)¶

When extraction runs as the first stage of Select mode (driven by the select config, not a standalone prompt YAML), the prompt is synthesized in code by bsllmner2.pipeline.build_extract_prompt_for_select(). Its user message includes two rule blocks:

Output rules — JSON-only output, per-field value-type handling, prefer exact mentions, avoid hallucination
Category assignment rules — domain-agnostic boundaries that mitigate cross-field leaks observed in large-scale runs:
Each extracted value belongs to at most one category; ambiguous values must pick the single most appropriate one by biological meaning
Values are classified by biological meaning, not by the attribute key/label in the input (e.g., if an attribute labeled drug actually contains HeLa, it belongs in cell_line)
Experimental control terms (negative control, NC, vehicle, mock, empty vector, scramble, non-targeting, shControl, siControl, …) are not extracted into any category — they are experimental conditions, not biological entities

These rules are intentionally generic (no ontology- or field-specific guidance) so bsllmner stays applicable to arbitrary select configs.

Customization¶

Copy the built-in prompt (bsllmner2/prompt/prompt_extract.yml)
Edit the category descriptions and output rules
Specify the file with --prompt

Output Format Specification¶

When a JSON Schema is specified with --format, the Ollama structured output feature (format parameter) constrains the output format.

Built-in schema (bsllmner2/format/cell_line.schema.json):

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "cell_line": { "type": ["string", "null"] }
  },
  "required": ["cell_line"],
  "additionalProperties": true
}

If --format is omitted, the LLM responds in free form. The last JSON object/array is extracted from the response using a regex and parsed.

Resume¶

When --resume is specified, processing continues from the previous interruption. The resume file is automatically deleted after successful completion.

The same --run-name must be specified as the original run. If the original run used the auto-generated name ({model}_{timestamp}), you need to find it from the resume file in bsllmner2-results/extract/.

Result Files¶

See Data Formats for the full result schema.

File	Description
`bsllmner2-results/extract/{run_name}.json`	Complete result
`bsllmner2-results/extract/{run_name}_resume.json`	Resume intermediate file (during processing only)

The default run_name is {model}_{YYYYMMDD_HHMMSS} (UTC).