Adam Arkin ORCID

Research Question

Can we predict bacterial growth phenotype — at multiple resolutions from binary growth through continuous kinetics to complex dynamics — from genome content and growth condition, in a way where the predictive features are biologically interpretable, validated against independent fitness data, and actionable for rational experimental design at a contaminated field site?

Overview

The project sits at a unique convergence of five datasets for the same Oak Ridge field isolates:

Dataset Scale What it provides
ENIGMA growth curves 303 plates, 27,632 curves, 123 strains, 195 molecules Continuous growth phenotype (lag, µmax, max OD, AUC, diauxy)
ENIGMA Genome Depot 3,110 genomes, 3.7M KO, 6.4M COG, 29.4M OG annotations Pre-computed genome features for all 123 growth strains
Fitness Browser 7 matching strains, 27M fitness scores Independent gene-level validation of predictor features
Web of Microbes 6 matching strains, 105 metabolites each Exometabolomic ground truth
Carbon source phenotypes 795 genomes × 379 conditions = ~53K binary labels Broad pretraining corpus (Dileep et al., preprint)

The project is structured in three acts:

Act I — Know the Collection (NB01-NB04):
- NB01 [done]: Growth curve fitting (27,632 curves, modified Gompertz, QC flags)
- NB02: Condition canonicalization and cross-dataset alignment (ChEBI-based)
- NB03: Functional diversity census (phylogeny, metabolic guilds, resistance/motility/mobile elements, pangenome outliers)
- NB04: Environmental context and biogeography (pangenome species-level via ncbi_env, microbeatlas global 16S, CORAL local Oak Ridge, SparCC co-occurrence)

Act II — Predict and Explain (NB05-NB08):
- NB05 [done]: Feature engineering (4,305 prevalence-filtered KOs, 4 feature levels, no PCA)
- NB06: GBDT variance partitioning + SHAP + FB concordance (nested M0→M3, the central analytical notebook)
- NB07: CSP transfer learning (pretrain on 795-genome corpus, fine-tune on ENIGMA)
- NB08: WoM exometabolomic prediction (pilot, 6 strains)

Act III — Diagnose and Propose (NB09-NB10):
- NB09 [done]: Conflict detection — 65.1% accuracy, 1,276 high-confidence errors, per-genus × condition-class error hotspots
- NB10 [done]: Active learning proposal — 50 experiments ranked by error × uncertainty × field relevance (fumaric acid, melibionic acid, nitrate; Prescottella, Microbacterium)

Data Collections

Review

AI Review BERIL Automated Review (Claude, claude-sonnet-4-20250514) 2026-04-24 Reviewed

Summary

This is an exceptionally comprehensive and methodologically sophisticated project that addresses a fundamental challenge in microbiology: predicting bacterial growth phenotype from genomic content and environmental conditions. The project successfully integrates five major datasets (ENIGMA growth curves, ENIGMA Genome Depot, Fitness Browser, Web of Microbes, Carbon Source Phenotypes) spanning 27,632 growth curves, 123 strains, and 46,389 training pairs to deliver both theoretical insights and practical applications for contaminated subsurface research. The work demonstrates that binary growth on amino acids and nucleosides is genuinely predictable from KO content (AUC ~0.78), while establishing biological limits for continuous parameter prediction and identifying mechanistically coherent features. The project exemplifies reproducible computational biology with outstanding documentation, 50 figures, extensive cached data artifacts, and a clear three-act analytical structure culminating in actionable experimental recommendations for Oak Ridge field research.

Methodology

The research approach is exemplary in both scope and rigor. The central research question—whether bacterial growth phenotype is predictable from genome content and growth condition in a biologically interpretable and experimentally actionable way—is clearly stated and systematically addressed through six testable hypotheses (H1-H6). The unique convergence of five complementary datasets for the same Oak Ridge field isolates represents an unprecedented resource for genotype-phenotype modeling, with 486 anchor pairs where growth curves, gene fitness, and genome annotations coexist.

The experimental design employs appropriate statistical methods matched to sample sizes: gradient-boosted decision trees (GBDT) for the full 46K-pair corpus, univariate correlation for small-n exometabolomic analysis, and genus-blocked holdout for rigorous cross-validation across 106 genera. Feature engineering is principled, using prevalence-based filtering (4,305 informative KOs) while preserving interpretability through named KEGG orthologs rather than dimensionality reduction. The three-act structure (Know the Collection, Predict and Explain, Diagnose and Propose) provides logical progression from data characterization through model development to actionable recommendations.

Data source identification is comprehensive and clearly documented across seven BERDL collections. The project appropriately handles known technical challenges, including the strain name collision pitfall (documented in docs/pitfalls.md) and string-typed numeric columns in Fitness Browser data. The modeling methodology section provides sufficient detail for replication, including specific hyperparameters, validation strategy, and feature interpretation approaches.

Code Quality

The analytical implementation demonstrates strong software practices adapted to biological data complexity. SQL queries for BERDL data extraction are correctly structured with appropriate type casting (CAST(fit AS DOUBLE) for Fitness Browser), handling of heterogeneous schemas across growth curve bricks, and efficient aggregation strategies to avoid memory issues with large tables.

Statistical methods are appropriately chosen for each analytical context: GBDT for high-dimensional multivariate prediction with genus-blocked validation, univariate per-metabolite correlation for small-sample exometabolomics, and SHAP analysis with correlation grouping for interpretable feature importance. The transition analysis from n=7 (genome-scale features) to n=46K (condition-specific features) effectively demonstrates the data requirements for mechanistic prediction.

The project successfully navigates several technical challenges documented in docs/pitfalls.md, including the strain name collision issue that affected 12 of 32 linkages between ENIGMA strains and the BERDL pangenome. Quality control procedures are systematically applied throughout, from growth curve fitting (modified Gompertz with multiple QC flags) through final model validation (genus-blocked holdout preventing phylogenetic leakage).

Notebook organization follows a logical progression with clear section headers, appropriate figure generation with consistent styling, and careful handling of edge cases (heterogeneous brick schemas, missing pH/temperature data, etc.).

Findings Assessment

The project's conclusions are well-supported by comprehensive data analysis and appropriately acknowledge both successes and limitations. The central finding that binary growth on amino acids (AUC 0.775) and nucleosides (0.780) is predictable from KO content while continuous parameters (µmax, lag, yield) are not represents a genuine biological insight supported by 46K training pairs across 106 genera.

The mechanistic interpretability of features is demonstrated through SHAP analysis identifying condition-specific catabolic genes (ribose transporter predicting ribose growth, protocatechuate cycloisomerase predicting aromatic catabolism). However, the project honestly reports weak Fitness Browser concordance (1.19× enrichment), correctly interpreting this as reflecting different biological questions (gene presence across genera vs. gene essentiality within strains) rather than invalidating the approach.

The global pH-driven niche partition connecting local Oak Ridge co-occurrence to 464K worldwide samples represents an unexpected and significant ecological finding, demonstrating how local contamination patterns reflect global microbial biogeography. The exometabolomic analysis successfully recovers 940 mechanistic gene-metabolite associations despite multivariate model failure at n=6, illustrating the importance of method selection for sample size.

Limitations are appropriately acknowledged: condition alignment remains string-based rather than ChEBI-ID based; continuous parameter prediction fails under cross-genus validation; and co-occurrence analysis represents correlation rather than causation. The project appropriately distinguishes between data limitations and fundamental biological constraints.

Suggestions

  1. Complete ChEBI-ID canonicalization: The current 42 molecular matches via string normalization could expand to 60-80 via systematic ChEBI ID resolution using the sys_oterm_id field in growth curve bricks, potentially strengthening cross-dataset validation.

  2. Implement formal active learning validation: While the 50-experiment ranking framework is in place, retrospective subsampling (AL-ranked vs. random experiment selection) would provide quantitative validation of the proposed approach.

  3. Add codon usage bias features: The fundamental limitation in continuous parameter prediction could potentially be addressed by computing CUB metrics from nucleotide sequences, as suggested in the Xu et al. (2025) Phydon approach.

  4. Expand FB concordance analysis: Consider KEGG-module-level expansion of SHAP features to distinguish "wrong feature, right pathway" from "wrong pathway" and potentially recover additional mechanistic signal.

  5. Strengthen co-occurrence analysis: Apply SparCC (compositionally aware correlation) to the full 100WS ASV matrix rather than relying on Spearman correlation on genus-level data.

  6. Document resource requirements: While runtime estimates are provided, memory requirements and recommended computing resources for full reproduction would help future users.

  7. Consider validation dataset expansion: The relatively small FB anchor set (7 strains) could potentially be expanded through additional strain-organism matching to strengthen biological validation.

This review was generated by an AI system. It should be treated as advisory input, not a definitive assessment.

Visualizations

Nb00 Condition Overlap

Nb00 Condition Overlap

Nb00 Growth Corpus Stats

Nb00 Growth Corpus Stats

Nb00 Per Strain Curves

Nb00 Per Strain Curves

Nb00 Strain Tiers

Nb00 Strain Tiers

Nb01 Curve Fit Examples

Nb01 Curve Fit Examples

Nb01 Parameter Distributions

Nb01 Parameter Distributions

Nb01 Qc Summary

Nb01 Qc Summary

Nb01 Single Brick Qc

Nb01 Single Brick Qc

Nb02 Anchor Growth Heatmap

Nb02 Anchor Growth Heatmap

Nb02 Condition Overlap 4Way

Nb02 Condition Overlap 4Way

Nb02 Coverage Summary

Nb02 Coverage Summary

Nb03 Cog By Guild

Nb03 Cog By Guild

Nb03 Genome Vs Ko

Nb03 Genome Vs Ko

Nb03 Ko Pca Guilds

Nb03 Ko Pca Guilds

Nb04 Cluster Env Comparison

Nb04 Cluster Env Comparison

Nb04 Genus Env Heatmap

Nb04 Genus Env Heatmap

Nb04 Global Cooccurrence

Nb04 Global Cooccurrence

Nb04 Global Genus Occurrence

Nb04 Global Genus Occurrence

Nb04 Oak Ridge Strain Map

Nb04 Oak Ridge Strain Map

Nb04 Oakridge Cooccurrence

Nb04 Oakridge Cooccurrence

Nb04 Pangenome Env Profiles

Nb04 Pangenome Env Profiles

Nb04 Well Guild Distribution

Nb04 Well Guild Distribution

Nb05 Feature Summary

Nb05 Feature Summary

Nb05 Ko Prevalence Filter

Nb05 Ko Prevalence Filter

Nb05 Target Distributions

Nb05 Target Distributions

Nb06 Feature Correlation

Nb06 Feature Correlation

Nb06 Group Shap

Nb06 Group Shap

Nb06 Shap Top20

Nb06 Shap Top20

Nb06 Variance Partition

Nb06 Variance Partition

Nb07 Bulk Vs Continuous

Nb07 Bulk Vs Continuous

Nb07 Confusion Matrices

Nb07 Confusion Matrices

Nb07 Coverage Gap

Nb07 Coverage Gap

Nb07 Fb Concordance

Nb07 Fb Concordance

Nb07 Fb Concordance Detail

Nb07 Fb Concordance Detail

Nb07 Full Corpus Results

Nb07 Full Corpus Results

Nb07 Full Corpus Shap

Nb07 Full Corpus Shap

Nb07 Model Comparison

Nb07 Model Comparison

Nb07 Model Diagnostics

Nb07 Model Diagnostics

Nb07 Per Condition Auc

Nb07 Per Condition Auc

Nb07 Roc Curves

Nb07 Roc Curves

Nb07 Shap Beeswarm

Nb07 Shap Beeswarm

Nb07 Shap Comparison N7 Vs Full

Nb07 Shap Comparison N7 Vs Full

Nb08 Fb Cognate Results

Nb08 Fb Cognate Results

Nb08 Mechanistic Examples

Nb08 Mechanistic Examples

Nb08 Method Comparison

Nb08 Method Comparison

Nb08 Production Ko Heatmap

Nb08 Production Ko Heatmap

Nb08 Wom Prediction

Nb08 Wom Prediction

Nb09 Active Learning Candidates

Nb09 Active Learning Candidates

Nb09 Conflict Detection

Nb09 Conflict Detection

Nb10 Active Learning Proposal

Nb10 Active Learning Proposal

Notebooks

Data Files

Filename Size
active_learning_candidates.tsv 43.5 KB
anchor_set.tsv 68.6 KB
bulk_continuous_results.tsv 0.3 KB
cluster_env_comparison.tsv 0.1 KB
cog_class_descriptions.tsv 0.9 KB
cog_matrix.tsv 17.0 KB
condition_canonical.tsv 95.3 KB
coral_strain_locations.tsv 12.2 KB
coverage_matrix.tsv 1932.6 KB
csp_transfer_predictions.tsv 25.5 KB
fb_concordance.tsv 0.4 KB
fb_concordance_matched.tsv 0.6 KB
fb_ko_mapping.tsv 54.2 KB
feature_correlation_groups.tsv 16.0 KB
full_corpus_binary_results.tsv 11.5 KB
full_corpus_predictions.tsv 3280.8 KB
full_corpus_shap.tsv 337.8 KB
gapmind_predictions.tsv 15.2 KB
global_genus_cooccurrence_jaccard.tsv 6.3 KB
group_shap_importance.tsv 2.2 KB
interaction_feature_comparison.tsv 11.6 KB
ko_matrix.parquet 4021.2 KB
microbeatlas_genus_summary.tsv 0.4 KB
model_predictions.tsv 82.4 KB
oakridge_genus_cooccurrence.tsv 4.6 KB
pangenome_env_profiles.tsv 1.8 KB
per_condition_accuracy.tsv 27.7 KB
proposed_experiments.tsv 7.2 KB
shap_importance.tsv 298.1 KB
strain_scalars.tsv 22.7 KB
variance_partition.tsv 1.0 KB
wom_fb_ko_correlations.tsv 83.0 KB
wom_prediction_results.tsv 0.4 KB
wom_predictions.tsv 31.0 KB