Extract Mode¶
Performs Named Entity Recognition (NER) on BioSample records using an LLM to extract biological information in specified categories.
Overview¶
BioSample JSON
|
v
+------------+
| Load |
| bs_entries |
+------------+
|
v
+------------+
| Build |
| messages |
+------------+
|
v
+------------+
| Ollama |
| chat() |
+------------+
|
v
+------------+
| Parse JSON |
| response |
+------------+
- Load BioSample entries from JSON/JSONL
- Apply prompt (YAML) and format schema (JSON Schema)
- Send batch requests to Ollama
- Extract and parse JSON from responses
CLI Options¶
Common Options¶
| Option | Description | Default |
|---|---|---|
--bs-entries |
Path to the input JSON or JSONL file containing BioSample entries (required) | -- |
--model |
LLM model to use for NER | llama3.1:70b |
--thinking BOOL |
Enable or disable thinking mode for the LLM (true/false) |
false |
--max-entries |
Process only the first N entries (-1 for all) |
-1 |
--ollama-host |
Host URL for the Ollama server | http://localhost:11434 |
--debug |
Enable debug mode for more verbose logging | false |
--run-name |
Name of the run for identification purposes | {model}_{timestamp} |
--resume |
Resume from the last incomplete run | false |
--batch-size |
Number of entries to process in each batch | 1024 |
--num-ctx |
Context length for Ollama | 4096 |
Extract-Specific Options¶
| Option | Description | Default |
|---|---|---|
--prompt |
Path to the prompt file in YAML format | bsllmner2/prompt/prompt_extract.yml |
--format |
Path to the JSON schema file for the output format | None |
Usage Examples¶
bsllmner2_extract \
--bs-entries tests/data/example_biosample.json \
--prompt bsllmner2/prompt/prompt_extract.yml \
--format bsllmner2/format/cell_line.schema.json \
--model llama3.1:70b \
--debug
With Docker:
docker compose exec app bsllmner2_extract \
--bs-entries tests/data/example_biosample.json \
--model llama3.1:70b
Prompt Specification¶
Prompts are defined as a YAML list where each element has role and content.
- role: system
content: |-
You are a smart curator of biological data
- role: user
content: |-
I will input JSON formatted metadata of a sample...
Here is the input metadata:
At runtime, the BioSample entry JSON is appended to the content of the last message.
Customization¶
- Copy the built-in prompt (
bsllmner2/prompt/prompt_extract.yml) - Edit the category descriptions and output rules
- Specify the file with
--prompt
Output Format Specification¶
When a JSON Schema is specified with --format, the Ollama structured output feature (format parameter) constrains the output format.
Built-in schema (bsllmner2/format/cell_line.schema.json):
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"cell_line": { "type": ["string", "null"] }
},
"required": ["cell_line"],
"additionalProperties": true
}
If --format is omitted, the LLM responds in free form. The last JSON object/array is extracted from the response using a regex and parsed.
Resume¶
When --resume is specified, processing continues from the previous interruption. The resume file is automatically deleted after successful completion.
The same --run-name must be specified as the original run. If the original run used the auto-generated name ({model}_{timestamp}), you need to find it from the resume file in bsllmner2-results/extract/.
Result Files¶
See Data Formats for the full result schema.
| File | Description |
|---|---|
bsllmner2-results/extract/{run_name}.json |
Complete result |
bsllmner2-results/extract/{run_name}_resume.json |
Resume intermediate file (during processing only) |
The default run_name is {model}_{YYYYMMDD_HHMMSS} (UTC).