Extract Mode¶
Performs Named Entity Recognition (NER) on BioSample records using an LLM to extract biological information in specified categories.
Overview¶
BioSample JSON
|
v
+------------+
| Load |
| bs_entries |
+------------+
|
v
+------------+
| Build |
| messages |
+------------+
|
v
+------------+
| Ollama |
| chat() |
+------------+
|
v
+------------+
| Parse JSON |
| response |
+------------+
- Load BioSample entries from JSON/JSONL
- Apply prompt (YAML) and format schema (JSON Schema)
- Send batch requests to Ollama
- Extract and parse JSON from responses
CLI Options¶
Common Options¶
| Option | Description | Default |
|---|---|---|
--bs-entries |
Path to the input JSON or JSONL file containing BioSample entries (required) | -- |
--model |
LLM model to use for NER | llama3.1:70b |
--thinking BOOL |
Enable or disable thinking mode for the LLM (true/false) |
false |
--max-entries |
Process only the first N entries (-1 for all) |
-1 |
--ollama-host |
Host URL for the Ollama server | http://localhost:11434 |
--debug |
Enable debug mode for more verbose logging | false |
--run-name |
Name of the run for identification purposes | {model}_{timestamp} |
--resume |
Resume from the last incomplete run | false |
--batch-size |
Number of entries to process in each batch | 1024 |
--num-ctx |
Context length for Ollama | 4096 |
Extract-Specific Options¶
| Option | Description | Default |
|---|---|---|
--prompt |
Path to the prompt file in YAML format | bsllmner2/prompt/prompt_extract.yml |
--format |
Path to the JSON schema file for the output format | None |
Usage Examples¶
bsllmner2_extract \
--bs-entries tests/data/example_biosample.json \
--prompt bsllmner2/prompt/prompt_extract.yml \
--format bsllmner2/format/cell_line.schema.json \
--model llama3.1:70b \
--debug
With Docker:
docker compose exec app bsllmner2_extract \
--bs-entries tests/data/example_biosample.json \
--model llama3.1:70b
Prompt Specification¶
Prompts are defined as a YAML list where each element has role and content.
- role: system
content: |-
You are a smart curator of biological data
- role: user
content: |-
I will input JSON formatted metadata of a sample...
Here is the input metadata:
At runtime, the BioSample entry JSON is appended to the content of the last message.
Select-Mode NER Prompt (build_extract_prompt_for_select)¶
When extraction runs as the first stage of Select mode (driven by the select config, not a standalone prompt YAML), the prompt is synthesized in code by bsllmner2.pipeline.build_extract_prompt_for_select(). Its user message includes two rule blocks:
- Output rules — JSON-only output, per-field value-type handling, prefer exact mentions, avoid hallucination
- Category assignment rules — domain-agnostic boundaries that mitigate cross-field leaks observed in large-scale runs:
- Each extracted value belongs to at most one category; ambiguous values must pick the single most appropriate one by biological meaning
- Values are classified by biological meaning, not by the attribute key/label in the input (e.g., if an attribute labeled
drugactually containsHeLa, it belongs incell_line) - Experimental control terms (
negative control,NC,vehicle,mock,empty vector,scramble,non-targeting,shControl,siControl, …) are not extracted into any category — they are experimental conditions, not biological entities
These rules are intentionally generic (no ontology- or field-specific guidance) so bsllmner stays applicable to arbitrary select configs.
Customization¶
- Copy the built-in prompt (
bsllmner2/prompt/prompt_extract.yml) - Edit the category descriptions and output rules
- Specify the file with
--prompt
Output Format Specification¶
When a JSON Schema is specified with --format, the Ollama structured output feature (format parameter) constrains the output format.
Built-in schema (bsllmner2/format/cell_line.schema.json):
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"cell_line": { "type": ["string", "null"] }
},
"required": ["cell_line"],
"additionalProperties": true
}
If --format is omitted, the LLM responds in free form. The last JSON object/array is extracted from the response using a regex and parsed.
Resume¶
When --resume is specified, processing continues from the previous interruption. The resume file is automatically deleted after successful completion.
The same --run-name must be specified as the original run. If the original run used the auto-generated name ({model}_{timestamp}), you need to find it from the resume file in bsllmner2-results/extract/.
Result Files¶
See Data Formats for the full result schema.
| File | Description |
|---|---|
bsllmner2-results/extract/{run_name}.json |
Complete result |
bsllmner2-results/extract/{run_name}_resume.json |
Resume intermediate file (during processing only) |
The default run_name is {model}_{YYYYMMDD_HHMMSS} (UTC).