Data Formats¶

BioSample JSON Input (bs_entries)¶

A list of BioSample entries. Supports JSON array or JSONL (one JSON object per line) format.

Each entry must have an accession field.

[
  {
    "accession": "SAMN00000001",
    "title": "HeLa cell RNA-seq",
    "characteristics": {
      "cell_line": "HeLa",
      "organism": "Homo sapiens"
    }
  }
]

JSONL format:

{"accession": "SAMN00000001", "title": "HeLa cell RNA-seq", ...}
{"accession": "SAMN00000002", "title": "HEK293 cell ChIP-seq", ...}

Mapping TSV (for evaluation)¶

A TSV file used for evaluating Select accuracy. A header row is required.

Note: The extraction answer column is the output of a previous tool (MetaSRA), not a human-curated ground truth. It is not used for evaluation. Only mapping answer ID (human-curated) is used as the gold standard for Select mode evaluation.

Column	Description
`BioSample ID`	BioSample accession
`Experiment type`	Experiment type
`extraction answer`	Previous tool output (not used for evaluation)
`mapping answer ID`	Human-curated ground truth mapping ID (used for Select evaluation)
`mapping answer label`	Ground truth mapping label

BioSample ID Experiment type extraction answer mapping answer ID mapping answer label
SAMN00000001 RNA-seq HeLa CVCL_0030 HeLa
SAMN00000002 RNA-seq HEK293 CVCL_0045 HEK293

Extract Result JSON (ExtractResult)¶

Saved to bsllmner2-results/extract/{run_name}.json.

{
  "entries": [
    {
      "accession": "SAMN00000001",
      "extracted": { "cell_line": "HeLa" },
      "raw_output": "{\"cell_line\": \"HeLa\"}",
      "llm_timing": {
        "total_duration": 1000000000,
        "load_duration": 100000000,
        "eval_count": 50,
        "eval_duration": 500000000,
        "prompt_eval_count": 100
      }
    }
  ],
  "run_metadata": {
    "run_name": "llama3.1:70b_20250101_120000",
    "model": "llama3.1:70b",
    "thinking": false,
    "start_time": "2025-01-01T12:00:00Z",
    "end_time": "2025-01-01T12:10:00Z",
    "status": "completed",
    "processing_time_sec": 600.0,
    "total_entries": 1
  },
  "performance": null,
  "errors": []
}

Key Fields¶

Path	Type	Description
`entries[].accession`	`string`	BioSample accession
`entries[].extracted`	`dict \\| list \\| null`	Parsed extraction result
`entries[].raw_output`	`string \\| null`	Raw JSON string from LLM
`entries[].llm_timing`	`LlmTimingFields`	Lightweight timing data (nanoseconds)
`run_metadata.run_name`	`string`	Run identifier
`run_metadata.model`	`string`	Model name
`run_metadata.start_time`	`datetime`	ISO 8601 UTC start time
`run_metadata.end_time`	`datetime \\| null`	ISO 8601 UTC end time
`run_metadata.status`	`"running" \\| "completed" \\| "failed"`	Run status
`run_metadata.processing_time_sec`	`float \\| null`	Processing time (seconds)
`run_metadata.total_entries`	`int \\| null`	Total processed entries
`errors`	`list[ErrorLog]`	Error information

LlmTimingFields¶

Lightweight timing fields extracted from ChatResponse (nanoseconds). Replaces the full ChatResponse in persisted output.

Field	Type	Description
`total_duration`	`int`	Total duration (ns)
`load_duration`	`int`	Model load duration (ns)
`eval_count`	`int`	Number of tokens generated
`eval_duration`	`int`	Token generation duration (ns)
`prompt_eval_count`	`int`	Number of prompt tokens

Select Result JSON (SelectResult)¶

Saved to bsllmner2-results/select/select_{run_name}.json.

{
  "entries": [
    {
      "extract": {
        "accession": "SAMN00000001",
        "extracted": { "cell_line": "HeLa", "tissue": "cervix" },
        "raw_output": "{\"cell_line\": \"HeLa\", \"tissue\": \"cervix\"}",
        "llm_timing": { "total_duration": 0, "load_duration": 0, "eval_count": 0, "eval_duration": 0, "prompt_eval_count": 0 }
      },
      "search_results": {
        "cell_line": {
          "HeLa": [
            {
              "term_uri": "http://purl.obolibrary.org/obo/CVCL_0030",
              "term_id": "CVCL:0030",
              "prop_uri": "http://www.w3.org/2000/01/rdf-schema#label",
              "value": "HeLa",
              "label": "HeLa",
              "exact_match": true,
              "text2term_score": null,
              "reasoning": null,
              "comments": ["Disease: Cervical adenocarcinoma"]
            }
          ]
        }
      },
      "text2term_results": {},
      "select_timings": {
        "cell_line": {
          "HeLa": { "total_duration": 500000000, "load_duration": 0, "eval_count": 20, "eval_duration": 200000000, "prompt_eval_count": 50 }
        }
      },
      "results": {
        "cell_line": [
          {
            "value": "HeLa",
            "term_id": "CVCL:0030",
            "term_uri": "http://purl.obolibrary.org/obo/CVCL_0030",
            "label": "HeLa",
            "exact_match": true,
            "reasoning": "Exact match found for HeLa"
          }
        ]
      }
    }
  ],
  "run_metadata": {
    "run_name": "llama3.1:70b_20250101_120000",
    "model": "llama3.1:70b",
    "thinking": false,
    "start_time": "2025-01-01T12:00:00Z",
    "end_time": "2025-01-01T12:15:00Z",
    "status": "completed",
    "processing_time_sec": 900.0,
    "total_entries": 1
  },
  "evaluation": null,
  "performance": null,
  "errors": []
}

Key Fields¶

Path	Type	Description
`entries[].extract`	`ExtractEntry`	Embedded extract result for this entry
`entries[].search_results`	`dict[field, dict[value, list[SearchResult]]]`	Stage 2a ontology search results
`entries[].text2term_results`	`dict[field, dict[value, list[SearchResult]]]`	Stage 2b text2term results
`entries[].select_timings`	`dict[field, dict[value, LlmTimingFields]]`	Per-field LLM timing
`entries[].results`	`dict[field, list[ResolvedValue]]`	Final mapping results
`evaluation`	`EvaluationMetrics \\| null`	Evaluation metrics (independent from RunMetadata). All ratio fields (`accuracy`, `precision`, `recall`, `f1`) are stored as 0–1 ratios, not percentages.
`errors`	`list[ErrorLog]`	Error information

ResolvedValue¶

Unified result type for Select mode output.

Field	Type	Description
`value`	`string`	Original extracted value
`term_id`	`string \\| null`	Matched ontology term ID
`term_uri`	`string \\| null`	Matched ontology term URI
`label`	`string \\| null`	Ontology term label
`exact_match`	`bool \\| null`	Whether it was an exact match
`reasoning`	`string \\| null`	LLM reasoning for selection

Select Config JSON¶

Configuration file for Select mode. Defines the ontology file, prompt, and filter for each field.

{
  "fields": {
    "cell_line": {
      "ontology_file": "/app/ontology/cellosaurus.owl",
      "prompt_description": "Cell line is a group of cells that are genetically identical...",
      "ontology_filter": { "hasDbXref": "NCBI_TaxID:9606" },
      "value_type": "string"
    },
    "drug": {
      "ontology_file": "/app/ontology/chebi.owl",
      "prompt_description": "Drug is a chemical or biological substance...",
      "value_type": "array"
    },
    "gene_perturbation": {
      "prompt_description": "Experimental perturbation applied to the target gene...",
      "value_type": "array"
    }
  }
}

For the full specification of each field, see Select Mode - Select Config Customization.

Prompt YAML¶

Prompts are defined in YAML as a list of role and content.

- role: system
  content: |-
    You are a smart curator of biological data
- role: user
  content: |-
    I will input JSON formatted metadata of a sample...
    Here is the input metadata:

role must be one of "system", "user", or "assistant".

Format JSON Schema¶

A JSON Schema that controls the LLM output format. Passed to the Ollama format parameter.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "cell_line": { "type": ["string", "null"] }
  },
  "required": ["cell_line"],
  "additionalProperties": true
}

In Select mode, the schema is dynamically generated from the SelectConfig field definitions (build_extract_schema_for_select). For value_type: "array", it is generated as {"type": ["array", "null"], "items": {"type": "string"}}. The generated schema always includes "additionalProperties": false.

PerformanceSummary¶

Performance data is embedded in the performance field of ExtractResult and SelectResult. There is no separate benchmark file; all data lives inside the result JSON.

Key Fields¶

Path	Type	Description
`performance.total_input_entries`	`int`	Total input entries
`performance.completed_count`	`int`	Entries that completed processing
`performance.total_wall_sec`	`float \\| null`	Total wall-clock time (seconds)
`performance.stage_timings[]`	`StageTimings[]`	Per-batch stage breakdown
`performance.ner_llm_timing`	`LlmTimingSummary \\| null`	Aggregated NER LLM timing stats
`performance.select_llm_timing`	`LlmTimingSummary \\| null`	Aggregated Select LLM timing stats (Select mode only)
`performance.disk_io`	`DiskIoTimings`	Disk I/O timing breakdown (Select mode only)

Accuracy metrics (accuracy, precision, recall, f1) are in SelectResult.evaluation, not in PerformanceSummary.

LlmTimingSummary Fields¶

Field	Description
`call_count`	Number of LLM calls
`total_duration_sec`	Sum of `total_duration` across all calls
`mean_latency_sec`	Mean latency per call (`total_duration - load_duration`)
`p50/p95/p99_latency_sec`	Latency percentiles
`mean_tokens_per_sec`	Mean generation speed (`eval_count / eval_duration`)
`p50/p95_tokens_per_sec`	tokens/sec percentiles
`mean_load_duration_sec`	Mean model load time (high = cold start)
`max_load_duration_sec`	Max model load time
`total_prompt_tokens`	Total prompt tokens processed
`total_eval_tokens`	Total tokens generated

For interpretation guidance, see benchmarking.md.