ChIP-Atlas Data Processing¶
This guide explains how to process ChIP-Atlas data using bsllmner-mk2 for both human (hg38) and mouse (mm10) genome assemblies.
Overview¶
What is ChIP-Atlas?¶
ChIP-Atlas is a comprehensive data-mining suite for exploring epigenomic landscapes by fully integrating ChIP-seq, ATAC-seq, DNase-seq, and Bisulfite-seq experiments.
Relationship with bsllmner-mk2¶
- Each SRX (experiment) entry in ChIP-Atlas is linked to a BioSample record
- SRX and BioSample have a one-to-one relationship
- ChIP-Atlas provides human-curated metadata values for each experiment
- bsllmner-mk2 can extract and map values from BioSample records using LLM
- This enables benchmarking LLM-based NER against human-curated annotations
Prerequisites¶
Before processing ChIP-Atlas data, ensure:
- Docker environment is running (see getting-started.md)
- Ontology files are downloaded and converted (see Quick Start ยง2)
Data Preparation¶
scripts/prepare_bs_entries.py¶
This script downloads and prepares BioSample entries from ChIP-Atlas:
# Inside Docker container
docker compose exec app python3 scripts/prepare_bs_entries.py --genome-assembly <GENOME>
Options:
--genome-assembly: Filter by genome assembly (e.g.,hg38,mm10)--force: Re-download files even if they already exist
Output Files (in chip-atlas-data/):
| File | Description |
|---|---|
experimentList.tab |
Raw metadata from ChIP-Atlas |
experimentList.json |
Parsed experiment metadata |
SRA_Accessions.tab |
SRX to BioSample mapping source |
srx_to_biosample.json |
SRX to BioSample ID mapping |
bs_entries.jsonl |
BioSample entries (one per line) |
bs_entries/{prefix}/{accession}.json |
Cached individual BioSample files |
Select Configuration¶
Configuration Structure¶
The select-config.json file defines which fields to extract and map to ontologies:
{
"fields": {
"field_name": {
"ontology_file": "/app/ontology/example.owl",
"prompt_description": "Description for LLM",
"ontology_filter": {
"hasDbXref": "NCBI_TaxID:9606"
},
"value_type": "string"
}
}
}
For field property details, see Select Mode - Select Config Customization.
Provided Configuration Files¶
| File | TaxID | Fields | Use Case |
|---|---|---|---|
select-config.json |
9606 | 7 (cell_line, cell_type, tissue, disease, drug, gene, gene_perturbation) | Full extraction with gene info |
select-config-hg38.json |
9606 | 5 (cell_line, cell_type, tissue, disease, drug) | Human ChIP-Atlas evaluation |
select-config-mm10.json |
10090 | 5 (cell_line, cell_type, tissue, disease, drug) | Mouse ChIP-Atlas evaluation |
Note: You should customize the select-config based on your specific needs.
Key Differences Between hg38 and mm10 Configs¶
| Aspect | hg38 | mm10 |
|---|---|---|
| cell_line TaxID filter | NCBI_TaxID:9606 |
NCBI_TaxID:10090 |
| disease TaxID filter | NCBITaxon:9606 |
None (no filter) |
Processing hg38 (Human)¶
1. Prepare Data¶
This downloads and processes human experiments from ChIP-Atlas.
2. Run Select Mode¶
docker compose exec app bsllmner2_select \
--bs-entries ./chip-atlas-data/bs_entries.jsonl \
--model llama3.1:70b \
--select-config ./scripts/select-config-hg38.json \
--run-name hg38-full \
--debug
3. Check Results¶
Processing mm10 (Mouse)¶
1. Backup Existing Data (if switching from hg38)¶
# Optional: backup hg38 data before overwriting
mv chip-atlas-data/bs_entries.jsonl chip-atlas-data/bs_entries_hg38.jsonl
mv chip-atlas-data/experimentList.json chip-atlas-data/experimentList_hg38.json
mv chip-atlas-data/srx_to_biosample.json chip-atlas-data/srx_to_biosample_hg38.json
2. Prepare Data¶
3. Run Select Mode¶
docker compose exec app bsllmner2_select \
--bs-entries ./chip-atlas-data/bs_entries.jsonl \
--model llama3.1:70b \
--select-config ./scripts/select-config-mm10.json \
--run-name mm10-full \
--debug
Large-Scale Processing Tips¶
Test with Limited Entries¶
Before processing the full dataset, test with a subset:
docker compose exec app bsllmner2_select \
--bs-entries ./chip-atlas-data/bs_entries.jsonl \
--select-config ./scripts/select-config-hg38.json \
--model llama3.1:70b \
--max-entries 100 \
--run-name hg38-test
Resume Interrupted Processing¶
If processing is interrupted, resume from where it left off:
docker compose exec app bsllmner2_select \
--bs-entries ./chip-atlas-data/bs_entries.jsonl \
--select-config ./scripts/select-config-hg38.json \
--model llama3.1:70b \
--run-name hg38-full \
--resume
Adjust Batch Size¶
If you encounter memory issues, reduce the batch size:
docker compose exec app bsllmner2_select \
--bs-entries ./chip-atlas-data/bs_entries.jsonl \
--select-config ./scripts/select-config-hg38.json \
--model llama3.1:70b \
--batch-size 256 \
--run-name hg38-full
Default batch size is 1024.
Sample Data for Quick Testing¶
Create a smaller dataset by sampling:
# Sample every 350th entry (reduces ~188k to ~500 entries)
awk 'NR % 350 == 1' chip-atlas-data/bs_entries.jsonl > chip-atlas-data/bs_entries.small.jsonl
# Run on sampled data
docker compose exec app bsllmner2_select \
--bs-entries ./chip-atlas-data/bs_entries.small.jsonl \
--select-config ./scripts/select-config-mm10.json \
--model llama3.1:70b \
--run-name mm10-test-small
Troubleshooting¶
Out of Memory Errors¶
Symptom: Container crashes during processing
Solutions:
- Reduce
--batch-size(e.g.,--batch-size 128) - Reduce Ollama parallel requests in
compose.yml:
- Use a smaller model (e.g.,
llama3.1:8binstead ofllama3.1:70b)
Data Download Failures¶
Symptom: Network errors during prepare_bs_entries.py
Solution:
# Re-run with --force to retry failed downloads
docker compose exec app python3 scripts/prepare_bs_entries.py \
--genome-assembly hg38 \
--force
Missing Ontology Files¶
Symptom: FileNotFoundError for .owl files
Solution:
- Run the download script:
python3 scripts/download_ontology_files.py - Convert Cellosaurus OBO to OWL (see Prerequisites section)
Data Volume Reference¶
Approximate data sizes by genome assembly:
| Assembly | Experiments | BioSample Entries |
|---|---|---|
| hg38 | ~200,000+ | ~150,000+ |
| mm10 | ~188,000 | ~140,000 |
Processing times vary based on:
- Model size (larger models = slower but more accurate)
- GPU performance
- Number of parallel requests (
OLLAMA_NUM_PARALLEL) - Batch size