Sort by: Recent Activity Status Alphabetical Author
Alphafold Msa Annotation Plan Report
Gazi S. Mahmud | ORCID: 0009-0006-4046-889X | Lawrence Berkeley National Laboratory

Research Question

Does AlphaFold MSA depth predict functional annotation richness in the bacterial pangenome, and is structural novelty (low MSA depth) systematically enriched in accessory and singleton gene families relative to core genes?

Key Findings

H1: Core genes are 2.9× more structurally represented than accessory genes

Core gene clusters have a median MSA depth of 15,308 — 2.89× higher than the auxiliary+singleton median of 5,299 and 2.77× higher than the auxiliary non-singleton median of 5,527. The separation is most...

Lignin Community Enrichment Plan Report
Markus de Raad (LBNL) — ORCID: [0000-0001-8263-9198]

Research Question

How does sequential enrichment on lignin, with or without labile carbon supplementation, shape bacterial and fungal community composition — and does enrichment history create ecological memory that influences subsequent community assembly?

Key Findings

Finding 1: Lignin enrichment massively restructures the bacterial community

The base community is dominated by unclassified taxa (Incertae Sedis, 44.3%) with low abundances of Pseudomonas (<0.1%). After a single round of lignin enrichment, Pseudomonas (39.3%) and Acinetobacter (25.2%)...

ENIGMA Carbon Census 1

Completed Reviewed
Enigma Carbon Census 1 Plan Report
Adam Arkin (University of California

Research Question

For 83 groundwater- and necromass-derived carbon compounds proposed for community
enrichment and isolate phenotyping, what is known about (1) their environmental
distribution, (2) the catabolic pathways that could utilize them, (3) the organisms
encoding those pathways, and (4) how those pathways co-occur within and across
organisms — and which ENIGMA isolates or environmentally observed organisms are
likely utilizers of each compound?

Key Findings

1. The census is mostly dark: 74 of 83 compounds have no isolate-level utilizer call

The governing result is a gap, not a coverage claim. Of 83 compounds (59 SSO-groundwater + 24 necromass), all 83 resolved to structures (InChIKey via PubChem), 54 linked to KEGG, but only **9 are...

Caulobacter Fur–Lipid A Loss

In Progress Needs Re-review
Caulobacter Fur Lipida Loss Plan Report
Adam Arkin (University of California

Research Question

Why does inactivation of fur (the ferric uptake regulator) permit the loss of lipid A in Caulobacter crescentus, when no equivalent connection to Fur or iron homeostasis is reported in the other three Gram-negative species known to tolerate lipid A loss (N. meningitidis, A. baumannii, M. catarrhalis)?

Key Findings

Finding 1 — Δfur ΔsspB is a dual-release switch (Δfur arm statistically supported; ΔsspB arm hypothesis-only)

The 4584-vs-4580 contrast (Δfur ΔsspB ΔrsaA vs ΔrsaA) correlates with Leaden 2018's Δfur signal at Spearman ρ = 0.315, p = 2.08e-03 over 93 Leaden Δfur DEGs,...

Berdl Data Atlas Plan Report
Adam Arkin (University of California

Research Question

What data is available in BERDL (across tenants, agencies, and programs), what
biological topics does it cover, and where do datasets amplify each other when
combined? Build a depiction that serves both KBase users picking what to
analyze and funders / PIs evaluating cross-agency, cross-program synergies.

Key Findings

Finding 1: BERDL has billion-row depth across 10+ biological dimensions

65 curated 'headline' tables hit on the live cluster show BERDL is simultaneously deep across genomes, genes, proteins, structures, phenotype/fitness, samples, community profiles, mass spec, viruses, biochemistry,...

Annotation Gap Discovery Plan Report
Janaka N. Edirisinghe (ORCID: [0000-0003-2493-234X](https://orcid.org/0000-0003-2493-234X))

Research Question

Can we systematically identify and resolve metabolic annotation gaps in bacterial genomes by integrating experimental growth phenotypes, gene fitness data, metabolic model gapfilling, pangenome context, and sequence homology?

Key Findings

1. Evidence Triangulation Resolves 47.8% of Annotation Gaps

Of 201 gapfilled enzymatic reaction-organism pairs across 14 Fitness Browser organisms and 18 carbon sources, 96 (47.8%) were assigned candidate genes with confidence scoring. This exceeds the pre-specified H1 threshold of 30%,...

Harvard Forest Warming Plan Report
Chris Mungall

Research Question

After ~25 years of +5°C experimental soil warming at the Harvard Forest Barre Woods plot, does the functional transcript pool (metatranscriptome KO/Pfam composition) diverge from the genome pool (metagenome KO/Pfam composition) more strongly than expected from neutral community turnover, and which functional categories drive any divergence?

Key Findings

1. Community composition reproduces the published Actinobacteria-up / Acidobacteria-down signal

Per-phylum Welch t-test on kraken2 read-based relative abundance (n=14 control + 14 heated direct samples, BH-FDR across phyla × horizon):

| Phylum | Horizon | Control mean | Heated mean |...

Metal Resistance Global Biogeography Report

Research Question

Where on Earth are metal-resistant environmental bacteria most abundant, and where are
the gaps in public metagenomic sampling that prevent us from knowing?

Key Findings

  • 260,652 environmental MAGs extracted from MGnify; 30,497 after filtering
    host-associated biomes.
  • 22,356 MAGs with usable geospatial coordinates (73.3% coordinate coverage of
    environmental MAG set).
  • ENA batch API retrieval: 24,511 sample records; 16,964 (69.2%) have valid...
Soil Frontier Genomics Report

Research Question

Key Findings

Clay Shield Hypothesis — Null Result:
- 5,441 soil samples with clay content, mine proximity, nighttime lights, uranium,
and functional gene counts (n_genes_by_counts).
- All predictive models have negative out-of-sample R²:
- Soil & Climate: R² = −0.205 ± 0.197
- Geochemical:...

T4ss Cazy Environmental Hgt Report

Research Question

Do Type IV secretion systems (T4SS) in environmental bacteria physically co-localise with
carbohydrate-active enzyme (CAZy) genes in a manner consistent with horizontal gene transfer,
and does phylogenetic analysis of syntenic GT2 glycosyltransferases reveal cross-phylum
transfer events that cannot be explained by vertical inheritance?

Key Findings

  • 21.8% of 30,497 high-quality environmental MAGs carry T4SS/conjugative machinery
    (6,652 MAGs; multi-marker definition: VirB4/6/8/9/10/11, VirD4, TraI/D, TrwB, TraG).
  • 92 CAZy families show elevated co-occurrence with T4SS loci at ≤10 kb (threshold
    validation pending); GT2...
Soil Metal Functional Genomics Report

Research Question

Which microbial functional gene categories (COGs) are significantly associated with soil
metal concentrations at global scale, and do these associations reflect metabolic adaptation,
resistance, or passive environmental filtering?

Key Findings

  • 2,355 significant COG-metal associations (FDR < 0.05) across 9 metals (Cu, Co,
    Cr, Ni, Zn, Pb, As, Cd, Hg) in 51,748 soil samples. Chromium and Lead drive the
    strongest signals; transporters (ABC, RND) and biosynthesis genes dominate top hits.
  • Copper-specific analysis (116 COGs,...
Microbeatlas Metal Ecology Plan Report

Research Question

Do metal-resistance functions in the global microbiome reflect phylogenetic constraint (conserved
in lineages) or environmental selection (enriched in metal-contaminated habitats)? We test this
using Pagel's λ to quantify phylogenetic signal, characterise whether metal-resistant lineages are
ecological generalists or specialists in a 464K-sample atlas, and use nitrification as a
metabolic-specialist positive control.

Key Findings

Finding 1: Bacterial niche breadth is moderately phylogenetically conserved; metal type diversity predicts it beyond phylogeny

Across 1,264 bacterial genera with ≥ 3 OTUs in MicrobeAtlas, Levins' B_std shows strong
phylogenetic signal (Pagel's λ = 0.787, p = 7.9×10⁻¹⁰², LRT). Habitat...

Lanthanide Methylotrophy Atlas

Completed Reviewed
Lanthanide Methylotrophy Atlas Plan Report

Research Question

Across BERDL's 293K-genome pangenome, what is the phylogenomic distribution and cassette-completeness of the lanthanide-dependent methanol/ethanol oxidation system (xoxF / xoxJ / PQQ / lanmodulin), and does its presence correspond to environments containing rare earth elements?

Key Findings

1. xoxF (REE-dependent MDH) outnumbers mxaF (Ca-dependent MDH) by ~19:1 across the BERDL pangenome — H1 strongly supported

Across 293,059 GTDB-r214 genomes, eggNOG KEGG_ko = K00114 (xoxF, lanthanide-dependent methanol dehydrogenase, EC 1.1.2.8) is annotated in 3,690 genomes, while...

Prophage Amr Comobilization Plan Report
Justin Reese, Claude (AI co-scientist

Research Question

At pangenome scale (293K genomes, 27K species), are antibiotic resistance genes preferentially located within or adjacent to prophage regions, and does this co-localization predict AMR gene mobility and accessory-genome status?

Key Findings

Finding 1: AMR genes frequently share contigs with prophage markers

Over half (55.7%) of AMR gene instances in the top-100 AMR-burdened species reside on contigs that also carry strict prophage markers (terminase, phage structural proteins, holin/lysin). Among these co-localized AMR genes,...

Clay Confined Subsurface Plan Report

Research Question

Do BERDL's cultured bacterial genomes from clay-confined deep-subsurface environments recapitulate the genomic signatures the recent literature has identified — biosynthetic self-sufficiency (Beaver & Neufeld 2024; Becraft 2021), the H₂-driven anaerobic chemolithoautotrophy toolkit (Wood–Ljungdahl + group 1 [NiFe]-hydrogenase + dissimilatory sulfate reduction, per Bagnoud 2016), and a cultivation-driven porewater-vs-rock-attached signature dichotomy (Bagnoud 2016 vs Mitzscherling 2023) — relative to surface soil microbes?

Key Findings

Finding 1 — Cultured clay-confined genomes carry the Bagnoud Mont Terri porewater signature, not the Mitzscherling rock-attached signature (H3, supported)

The 9 BERDL genomes traceable to clay-confined deep-subsurface biosamples (8 from Mont Terri Opalinus boreholes, 1 from a bentonite...

Bacillota B Subsurface Accessory Plan Report

Research Question

Within the Bacillota_B phylum (Desulfosporosinus, BRH-c8a Peptococcaceae, BRH-c4a Desulfotomaculales, and other obligate-anaerobe deep-subsurface Firmicutes), what accessory gene content distinguishes deep-clay-isolated genomes from phylum-matched soil-baseline genomes? Beyond the curated marker dictionary used in clay_confined_subsurface (which mostly tested Wood–Ljungdahl, [NiFe]-hydrogenase, dsr-apr-sat — and whose IR-side markers turned out to be wrong genes), what does the BERDL pangenome gene-cluster–level signal say about subsurface adaptation in this phylum?

Key Findings

Finding 1 — 547 eggNOG OGs are significantly enriched in deep-clay Bacillota_B vs soil-baseline Bacillota_B; the enriched set falls into the pre-registered functional categories (anaerobic respiration, sporulation revival, mineral attachment, regulators, osmoadaptation), with anaerobic...

Gene Function Ecological Agora

Completed Reviewed
Gene Function Ecological Agora Plan Report
Adam Arkin, ORCID: 0000-0002-4999-2931, Affiliation: U.C. Berkeley / Lawrence Berkeley National Laboratory

Research Question

Across the prokaryotic tree (GTDB r214; 293,059 genomes / 27,690 species), build a multi-resolution innovation + acquisition-depth atlas of bacterial function classes, anchored to clade phylogeny and environmental ecology. Per (clade × function-class) tuple, the atlas reports: production rate (paralog expansion above null), acquisition profile by depth (recent vs ancient gain events), MGE context, environmental ecology, and phenotype anchoring where data exist. Test whether regulatory and metabolic function classes show distinct acquisition-depth profiles, anchored to specific weak-prior hypotheses (Bacteroidota PUL, Mycobacteriota mycolic-acid, Cyanobacteria PSII, Alm 2006 TCS reproduction).

Plan v2.9 reframed the central deliverable from a single "regulatory-vs-metabolic asymmetry" headline to the broader innovation/acquisition/ecology atlas with regulatory-vs-metabolic as one diagnostic among several. See DESIGN_NOTES.md v2.8 entry.

Key Findings

Finding 1 — Producer null is responsive to known paralog signal

The natural_expansion class (200 UniRef50 clusters with documented within-species paralog count ≥ 3 and cross-species presence in ≥ 5 pilot species) shows positive producer z-scores at all five ranks, with effect size growing...

Metal Cross Resistance Plan Report
Paramvir S. Dehal

Research Question

Is the genetic architecture of metal cross-resistance conserved across phylogenetically diverse bacteria, or is it rewired species by species?

Key Findings

1. Metal cross-resistance is universal and directionally conserved (H1 strongly supported)

Across 317 organism-metal pair observations (28 organisms, 85 unique metal pairs), 98.1% of gene-level fitness correlations are positive (311/317) and 99.1% are statistically significant (p...

Ecotype Functional Differentiation Plan Report
Justin Reese

Research Question

Do gene-content ecotypes within bacterial species differ in their COG functional profiles, showing differentiation in adaptive functions while sharing core metabolism?

Key Findings

Finding 1: Gene-content ecotypes are widespread across bacterial species

Ecotype clustering via PCA + KMeans on auxiliary gene presence/absence matrices identified valid gene-content ecotypes in 12 of 15 sampled species (80%). Species averaged 3.7 ecotypes each (range: 2--6), with a mean...

Ibd Phage Targeting Plan Report
Adam Arkin

Research Question

Which bacterial pathobionts are enriched, ubiquitous, and non-protective in IBD, UC, and Crohn's disease patients — considered both across indications and within distinct patient subgroups defined by demographics, severity, native-microbiome structure, and treatment history — and of those pathobionts, which are tractable phage-therapy targets given the available (or characterizable) phages, their host range, the evolutionary escape routes their target strains have available, and the ecological consequences of removing them?

Three coupled deliverables:

  1. Patient stratification: a reproducible ecotype framework trained on public cohorts that each UC Davis patient can be assigned to, with ecotype-specific pathobiont signatures.
  2. Pathobiont target atlas: a scored list of candidate targets per ecotype (and per UC-Davis patient), ranked against an explicit biological / phage-availability / ecological-durability rubric.
  3. Per-patient cocktail drafts: for each of the ~21 unique UC Davis patients, a proposed phage cocktail with candidate phages, strain-coverage evidence, Tier-B/C flags, and confidence notes.

Key Findings

1. Four reproducible IBD ecotypes with clear disease stratification

Training on 8,489 MetaPhlAn3 samples (fact_taxon_abundance, CMD_HEALTHY + CMD_IBD cohorts) with two independent methods — LDA on pseudo-counts and GMM on CLR + PCA-20 — across K ∈ {2..8}. Per-method fit measures (LDA...

Plant Microbiome Ecotypes

Completed Needs Re-review
Plant Microbiome Ecotypes Plan Report
Adam P. Arkin — U.C. Berkeley / Lawrence Berkeley National Laboratory

Research Question

What is the genomic basis for plant-microbe associations across different plant compartments (rhizosphere, root, phyllosphere, endophyte)? Can we classify plant-associated microbial genera into beneficial, neutral, pathogenic, and dual-nature cohorts with mechanistic hypotheses, and identify which plant-interaction functions are associated with horizontal gene transfer vs. stable vertical inheritance?

Key Findings

1. Plant compartments impose a small but real functional shift on microbial communities (H1, weakly supported)

Canonical Phase 2b finding: compartment identity explains a small fraction of variance in functional profiles after location-vs-dispersion separation. db-RDA (constrained...

Genotype To Phenotype Enigma Plan Report
Adam Arkin

Research Question

Can we predict bacterial growth phenotype — at multiple resolutions from binary growth through continuous kinetics to complex dynamics — from genome content and growth condition, in a way where the predictive features are biologically interpretable, validated against independent fitness data, and actionable for rational experimental design at a contaminated field site?

Acinetobacter baylyi ADP1 Data Explorer

Completed Needs Re-review
Acinetobacter Adp1 Explorer Plan Report
Beril Admin, Paramvir S. Dehal

Research Question

What is the scope and structure of a comprehensive ADP1 database, and how do its annotations, metabolic models, and phenotype data intersect with BERDL collections (pangenome, biochemistry, fitness, PhageFoundry)?

Key Findings

1. Rich Multi-Omics Database with 6 Data Modalities

The user-provided SQLite database contains 15 tables with 461,522 total rows and 135 MB of data for Acinetobacter baylyi ADP1 and 13 related genomes. The central genome_features table has 5,852 genes with 51 annotation columns spanning...

Enigma Sso Asv Ecology Plan Report
Adam Arkin

Research Question

Does 16S community similarity across the 9 ENIGMA SSO wells (3x3 grid, ~4 m span at Oak Ridge) recapitulate the spatial arrangement in X, Y, and Z? Where it deviates, can hydrogeological connectivity or environmental gradients explain the pattern? Can we infer functional differences from taxonomy and what they imply about subsurface environmental parameters?

Key Findings

1. Community Similarity Tracks Spatial Arrangement at Meter Scale

Sediment-associated microbial communities across the 9 SSO wells (3×3 grid, ~6 m span) show significant distance-decay of similarity (Mantel test: Spearman ρ = 0.323, p = 0.029, 9,999 permutations). Mean Bray-Curtis...

Pseudomonas Carbon Ecology Plan Report
Mark Andrew Miller ([ORCID: 0000-0001-9076-6066])

Research Question

Among free-living Pseudomonas clades, does the carbon source utilization profile predict the soil ecosystem type from which strains were isolated — and do clades that have transitioned to host-associated lifestyles show predictable losses of specific carbon pathways?

Key Findings

Finding 1: Host-Associated Pseudomonas Show Dramatic Loss of Plant-Derived Sugar Pathways

Pseudomonas sensu stricto (the P. aeruginosa group) shows near-complete loss of plant-derived sugar catabolism compared to Pseudomonas_E (the P. fluorescens/putida group). Of the 62 GapMind...

Cf Formulation Design Plan Report
Adam P. Arkin — U.C. Berkeley / Lawrence Berkeley National Laboratory

Research Question

Can we build a multi-criterion framework that explains measured P. aeruginosa PA14 inhibition from metabolic competition, growth kinetics, and patient ecology data, and use it to design optimal 1–5 organism commensal formulations for competitive exclusion in CF lungs?

Pgp Pangenome Ecology Plan Report
Priya Ranjan

Research Question

Does environmental selection shape the distribution of plant growth-promoting (PGP) bacterial genes across the BERDL pangenome (293K genomes, 27K species), and are those genes core or accessory within their carrier species?

Key Findings

H1 SUPPORTED — PGP traits form a non-random syndrome, but nitrogen fixation is ecologically distinct

Across 11,272 species with at least one PGP gene, 8 of 10 focal-gene pairs were significantly associated after BH-FDR correction. Five pairs showed positive co-occurrence and three showed...

Amr Environmental Resistome Plan Report
Paramvir S. Dehal

Research Question

Do antimicrobial resistance gene profiles differ between ecological niches across 27,000 bacterial species? Using 83K AMR gene clusters mapped to 293K genomes with environmental metadata, we test whether the resistome is structured by ecology — and whether intrinsic (core) and acquired (accessory) resistance show different environmental signatures.

Key Findings

1. Clinical species carry 2.5× more AMR gene clusters than environmental species (H1 supported)

Species from clinical sources have a median of 5 AMR gene clusters, compared to 2 for soil, aquatic, and host-associated species (Kruskal-Wallis H = 781.9, p = 9.4×10⁻¹⁶⁷, η² = 0.056)....

AMR Co-Fitness Support Networks

Completed Reviewed
Amr Cofitness Networks Plan Report
Paramvir S. Dehal

Research Question

What genes are co-regulated with antimicrobial resistance (AMR) genes across growth conditions, and do these "support networks" explain the uniform fitness cost of resistance? Using cofitness data and ICA fitness modules from 25 bacteria, we identify the functional context in which AMR genes operate.

Key Findings

1. AMR genes are embedded in larger-than-average co-regulated modules

Only 24% of AMR genes (192/801) are assigned to ICA fitness modules, but the modules they inhabit are significantly larger than non-AMR modules: median 46 vs 27 genes (MWU p = 1.7×10⁻⁸). This indicates that when AMR...

Counter Ion Effects Plan Report
Paramvir S. Dehal, Aindrila Mukhopadhyay

Research Question

When bacteria are exposed to metal salts (CoCl₂, NiCl₂, CuCl₂), how much of the observed fitness effect is caused by the metal cation versus the counter anion (chloride)? Does correcting for chloride confounding change the conclusions of the Pan-Bacterial Metal Fitness Atlas?

Key Findings

1. 39.8% of Metal-Important Genes Are Also NaCl-Important

Across 19 organisms and 14 metals (86 organism × metal pairs), 4,304 of 10,821 metal-important gene records (39.8%) are also important under NaCl stress. This substantial overlap exists for every metal tested — from 9.2% for...

Fitness Modules x Pangenome Conservation

Completed Needs Re-review
Module Conservation Plan Report
Paramvir S. Dehal

Research Question

Are ICA fitness modules enriched in core pangenome genes, and do cross-organism module families map to the core genome?

Key Findings

Module Genes Are More Core Than Average

  • Module genes: 86.0% core vs all genes: 81.5% (+4.5 percentage points)
  • Genes assigned to ICA modules are co-regulated functional units, and they skew toward the conserved core genome

(Notebook: 01_module_conservation.ipynb)

Most...

Field Vs Lab Fitness Plan Report
Paramvir S. Dehal

Research Question

Which genes matter for survival under environmentally-realistic conditions but appear dispensable in the lab, and vice versa? Do field-relevant fitness effects predict pangenome conservation better than lab-only effects?

Key Findings

ENIGMA CORAL Contains No DvH Fitness Data (NB01)

The ENIGMA CORAL database (47 tables, enigma_coral on BERDL) was surveyed for complementary data. Key finding: DvH is completely absent from the database. The single TnSeq library is for FW300-N2E2 (Pseudomonas), DubSeq libraries...

Metabolic Capability Dependency Plan Report
Christopher Neely, Sierra Moxon

Research Question

Just because a bacterium's genome encodes a complete metabolic pathway (metabolic capability), does the organism actually depend on it? Can we distinguish genomic capability from functional dependency using experimental fitness data?

Key Findings

H1 Supported: A Substantial Fraction of Complete Pathways Are Functionally Neutral

Across 1,695 pathway-organism pairs from 48 organisms, 15.8% of genomically complete pathways were classified as latent capabilities — pathways the genome encodes but that show no detectable fitness...

Within-Species AMR Strain Variation

Completed Needs Re-review
Amr Strain Variation Report
Paramvir S. Dehal

Research Question

Within a species, how does the AMR repertoire vary between strains, and what drives that variation?

Key Findings

Finding 1: The majority of AMR genes are variable or rare within species

Across 1,305 species and 180,025 genomes, 51.3% of AMR gene-species occurrences are rare (present in <=5% of strains), 41.3% are variable (5-95%), and only 7.5% are fixed (>=95%). The median variabilit

Pan-Bacterial Metal Fitness Atlas

Completed Needs Re-review
Metal Fitness Atlas Plan Report
Paramvir S. Dehal, Adam Deutschbauer

Research Question

Across diverse bacteria subjected to genome-wide fitness profiling under metal stress, what is the genetic architecture of metal tolerance — is it encoded in the core or accessory genome, is it conserved across species, and can fitness-validated metal tolerance genes predict capabilities across the broader pangenome?

Key Findings

1. Metal-Important Genes Are Enriched in the Core Genome

Across 22 organisms and 14 metals, genes with significant fitness defects under metal stress are 87.4% core vs 76.9% baseline (OR=2.08, p=4.3e-162). This is the opposite of the initial hypothesis (H1a), which predicted accessory...

The Pan-Bacterial Essential Metabolome

Completed Needs Re-review
Essential Metabolome Plan Report
Paramvir S. Dehal

Research Question

Which biochemical reactions are universally essential across bacteria, and what does the essential metabolome reveal about the minimal core metabolism required for microbial life?

Key Findings

High Conservation of Amino Acid Biosynthesis Pathways

17 of 18 amino acid biosynthesis pathways are present in all 7 organisms analyzed (100% within this sample):
- Complete pathways: arg, asn, chorismate, cys, gln, gly, his, ile, leu, lys, met, phe, pro, thr, trp, tyr, val

**One...

Adp1 Deletion Phenotypes Plan Report
Paramvir S. Dehal

Research Question

What is the condition-dependent structure of gene essentiality in Acinetobacter baylyi ADP1, as revealed by the de Berardinis single-gene deletion collection grown on 8 carbon sources?

Key Findings

1. Carbon sources define a three-tier essentiality landscape

The 8 carbon sources partition into demanding, moderate, and robust tiers based on the fraction of genes showing growth defects. Urea is the most demanding (97.9% of genes show severe defects at ratio < 0.5), while quinate is the...

ADP1 Triple Essentiality Concordance

In Progress Needs Re-review
Adp1 Triple Essentiality Plan Report
Paramvir S. Dehal

Research Question

Among genes that TnSeq says are dispensable in Acinetobacter baylyi ADP1, does FBA correctly predict which ones have growth defects? Can direct mutant growth rate measurements serve as an independent axis to evaluate where computational (FBA) and genetic (TnSeq) methods agree or disagree?

Pan-Bacterial AMR Gene Landscape

Completed Needs Re-review
Amr Pangenome Atlas Plan Report
Paramvir S. Dehal

Research Question

What is the distribution, conservation, phylogenetic structure, functional context, and environmental association of antimicrobial resistance (AMR) genes across 27,000 bacterial species pangenomes?

Key Findings

1. AMR Genes Are Massively Depleted from the Core Genome

AMR genes are significantly less conserved than the pangenome average: only 30.3% are core vs 46.8% baseline (OR=0.49, chi-squared=23,117, p≈0). The auxiliary genome is 2.2x enriched for AMR (33.6% vs 15.3%). This depletion...

Enigma Contamination Functional Potential Plan Report
Paramvir S. Dehal

Research Question

Do high-contamination Oak Ridge groundwater communities show enrichment for taxa with higher inferred stress-related functional potential compared with low-contamination communities?

Key Findings

Multiplicity and sample-size context (primary panel)

Model-family sample counts from data/model_family_sample_counts.tsv frame how much data each analysis used:

Mode Base Spearman n Adj+Cov n Adj+Fraction n High-coverage subset n
...
Core Gene Tradeoffs Plan Report
Paramvir S. Dehal

Research Question

Why are core genome genes MORE likely to show positive fitness effects when deleted, and what functions and conditions drive this burden paradox?

Key Findings

The Burden Paradox Is Function-Specific

Not all functional categories show the paradox. Core genes are disproportionately burdensome in Protein Metabolism (+6.2pp), Motility (+7.8pp), and RNA Metabolism (+12.9pp). But Cell Wall reverses: non-core cell wall genes are MORE burdensome...

Ecotype Analysis Plan Report
Paramvir S. Dehal

Research Question

What drives gene content similarity between bacterial genomes: environmental similarity or phylogenetic relatedness?

Key Findings

Analysis of 172 species with sufficient environmental and phylogenetic data reveals:

Phylogeny Usually Dominates

  • Median partial correlation for environment: 0.0025
  • Median partial correlation for phylogeny: 0.0143
  • Phylogeny dominates in 60.5% of species
  • Environment...

Web of Microbes Data Explorer

Completed Needs Re-review
Webofmicrobes Explorer Plan Report
Paramvir S. Dehal

Research Question

What does the kescience_webofmicrobes exometabolomics collection contain, which organisms overlap with the Fitness Browser, and how well do metabolite uptake/release profiles connect to pangenome-predicted metabolic capabilities?

Key Findings

1. WoM Action Encoding Uses Four Distinct Semantics, Not Three

The WoM database encodes metabolite observations with a 4-action system that differs between control and organism entries:

Actor Action Meaning Count
Control ("The Environment")...

The Pan-Bacterial Essential Genome

Completed Needs Re-review
Essential Genome Plan Report
Paramvir S. Dehal

Research Question

Which essential genes are conserved across bacteria, which are context-dependent, and can we predict function for uncharacterized essential genes using module context from non-essential orthologs?

Key Findings

15 Gene Families Are Essential in All 48 Bacteria

The absolute core of bacterial life: ribosomal proteins (rpsC, rplW, rplK, rplB, rplA, rplF, rps11, rpsJ, rpsI, rpsM), chaperonin (groEL), CTP synthase (pyrG), translation elongation factor G (fusA), valyl-tRNA synthetase (valS), and...

Ecotype Env Reanalysis Plan Report
Paramvir S. Dehal

Research Question

Does the environment effect on gene content become stronger when analysis is restricted to genuinely environmental samples, excluding human-associated genomes whose AlphaEarth embeddings reflect hospital satellite imagery rather than ecological habitat?

Key Findings

1. Clinical bias does NOT explain the weak environment signal (H0 not rejected)

Environmental species (n=37, median partial correlation 0.051) do NOT show stronger environment–gene content correlations than human-associated species (n=93, median 0.084). The Mann-Whitney U test is far from...

Respiratory Chain Wiring Plan Report
Paramvir S. Dehal

Research Question

How is Acinetobacter baylyi ADP1's branched respiratory chain wired across carbon sources — which NADH dehydrogenases and terminal oxidases are required for which substrates?

Key Findings

1. Each carbon source uses a distinct respiratory chain configuration

ADP1's branched respiratory chain (62 genes across 8 subsystems) is wired in a condition-dependent manner. Quinate requires only Complex I; acetate requires Complex I, cytochrome bo3, ACIAD3522, and more; glucose requires...

Conservation Vs Fitness Plan Report
Paramvir S. Dehal

Research Question

Are essential genes preferentially conserved in the core genome, and what functional categories distinguish essential-core from essential-auxiliary genes?

Key Findings

  • 44 of 48 FB organisms mapped to pangenome species clades
  • 177,863 gene-to-cluster links at 100.0% median protein identity, 94.2% median gene coverage
  • 34 organisms have >=90% coverage; 33 used for downstream analysis (Dyella79 excluded due to locus tag...

Temporal Core Genome Dynamics

In Progress Needs Re-review
Temporal Core Dynamics Plan
Paramvir S. Dehal

Research Question

How does core genome composition change over sampling time, and do genes transition in and out of core status?

Bacdive Phenotype Metal Tolerance Plan Report
Paramvir S. Dehal

Research Question

Can BacDive-measured bacterial phenotypes (Gram stain, oxygen tolerance, metabolite utilization, enzyme activities) predict metal tolerance as measured by Fitness Browser experiments and the Metal Fitness Atlas?

Key Findings

1. Gram-Negative Bacteria Have Significantly Higher Metal Tolerance Scores (d=-0.61)

Gram-negative species have higher metal tolerance scores than Gram-positive species (Cohen's d = -0.61, p < 1e-60, n = 3,272 species). This is the largest effect among all phenotype features tested....

Cog Analysis Plan Report
Paramvir S. Dehal

Research Question

How do COG functional category distributions differ across core, auxiliary, and novel genes in bacterial pangenomes?

Key Findings

Universal Functional Partitioning in Bacterial Pangenomes

Analysis of 32 species across 9 phyla (357,623 genes) reveals a remarkably consistent "two-speed genome":

Novel/singleton genes consistently enriched in:
- L (Mobile elements): +10.88% enrichment, 100% consistency across species...

The 5,526 Costly + Dispensable Genes

Completed Needs Re-review
Costly Dispensable Genes Plan Report
Paramvir S. Dehal

Research Question

What characterizes genes that are simultaneously burdensome (fitness improves when deleted) and not conserved in the pangenome? Are they mobile elements, recent acquisitions, degraded pathways, or something else?

Key Findings

Costly+Dispensable Genes Are Mobile Genetic Elements

The 5,526 costly+dispensable genes are overwhelmingly associated with mobile genetic elements. They are 7.45x more likely to contain mobile element keywords in their descriptions (transposase, integrase, phage, IS element,...

Conservation Fitness Synthesis Plan Report
Paramvir S. Dehal

Research Question

How does a gene's importance for bacterial survival relate to its evolutionary conservation, and what does the conserved genome actually look like?

Key Findings

The Gradient

There is a clear, quantitative gradient from essential genes (82% core) to always-neutral genes (66% core). More important genes are more conserved -- but the effect is modest. Even genes with no detectable fitness effect in any experiment are 66% core. The gradient spans...

Env Embedding Explorer Plan Report
Paramvir S. Dehal

Research Question

What do AlphaEarth environmental embeddings capture, and how do they relate to geographic coordinates and NCBI environment labels?

Key Findings

1. Environmental samples show 3.4x stronger geographic signal than human-associated samples

AlphaEarth embeddings encode geographic/environmental signal, but the strength depends on the sample source. For environmental samples (Soil, Marine, Freshwater, Extreme, Plant), nearby genomes...

Aromatic Catabolism Network Plan Report
Paramvir S. Dehal

Research Question

Why does aromatic catabolism in Acinetobacter baylyi ADP1 require Complex I (NADH dehydrogenase), iron acquisition, and PQQ biosynthesis when growth on other carbon sources does not?

Key Findings

1. Aromatic catabolism requires a 51-gene support network spanning 4 metabolic subsystems

The 51 quinate-specific genes in ADP1 organize into a coherent metabolic dependency network around the β-ketoadipate pathway. Co-fitness analysis assigns 44/51 genes (86%) to four functional...

Metabolic Capability vs Dependency

Completed Needs Re-review
Pathway Capability Dependency Plan Report
Dileep Kishore, Paramvir S. Dehal

Research Question

When a bacterium's genome encodes a complete biosynthetic or catabolic pathway, does the organism actually depend on it? Can we use fitness data to distinguish active dependencies from latent capabilities — and predict which pathways are candidates for evolutionary gene loss?

Key Findings

1. Pathway Completeness Alone Is Insufficient to Predict Metabolic Dependency

Of 161 classified organism-pathway pairs (7 Fitness Browser organisms, 23 GapMind pathways), only 35.4% (57/161) are Active Dependencies where a complete pathway contains fitness-important genes. The largest...

Fitness Effects Conservation Plan Report
Paramvir S. Dehal

Research Question

Is there a continuous gradient from essential genes (core) to dispensable genes (accessory) across the full fitness spectrum, and what does the fitness landscape of novel genes look like?

Key Findings

Conservation Increases with Fitness Importance

A clear gradient from essential to neutral genes:

Fitness category n genes % Core
Essential (no viable mutants) 27,693 82%
Often sick (>10% experiments) 15,989 78%
Mixed...
Cofitness Coinheritance Plan Report
Paramvir S. Dehal

Research Question

Do genes with correlated fitness profiles (co-fit) tend to co-occur in the same genomes across a species' pangenome? Does functional coupling constrain which genes are gained and lost together?

Key Findings

Pairwise Co-fitness Weakly Predicts Co-occurrence

Across 9 organisms with co-fitness data (2.25M cofit pairs vs 22.5M prevalence-matched random pairs), co-fit gene pairs show a weak but consistent positive co-occurrence signal. The mean delta phi (cofit - random) is +0.011 across organisms,...

Fw300 Metabolic Consistency Plan Report
Paramvir S. Dehal

Research Question

For Pseudomonas fluorescens FW300-N2E3 (ENIGMA groundwater isolate), how consistent are exometabolomic outputs (Web of Microbes), genome-wide gene fitness (Fitness Browser), species-level utilization phenotypes (BacDive), and computational pathway predictions (GapMind)?

Key Findings

1. High overall concordance across databases (94% mean concordance)

Of the 58 metabolites produced or increased by FW300-N2E3 (Web of Microbes), 21 could be cross-referenced against at least one other database. Among these testable metabolites, 17/21 (81%) were fully concordant across all...

Fitness Modules Plan Report
Paramvir S. Dehal

Research Question

Can we decompose RB-TnSeq fitness compendia into latent functional modules via robust ICA, align them across organisms using orthology, and use module context to predict gene function?

Key Findings

  1. The strict membership threshold (|weight| >= 0.3, max 50 genes) was critical. The initial D'Agostino K-squared approach gave 100-280 genes per module with weak cofitness signal (59% enriched, 1-17x correlation). After switching to absolute weight thresholds, modules became biologically...
Bacdive Metal Validation Plan Report
Paramvir S. Dehal

Research Question

Do bacteria isolated from metal-contaminated environments have higher predicted metal tolerance scores than bacteria from uncontaminated environments?

Key Findings

1. Bacteria From Metal-Contaminated Environments Have Significantly Higher Metal Tolerance Scores

Organisms isolated from heavy metal contamination sites have metal tolerance scores a full standard deviation above the environmental baseline (Cohen's d = +1.00, Mann-Whitney p=0.006, n=10)....

Pangenome Openness Plan Report
Paramvir S. Dehal

Research Question

Do open pangenomes show different patterns of environmental vs phylogenetic effects compared to closed pangenomes?

Key Findings

No Correlation Found

Analysis of pangenome openness vs environment/phylogeny effects revealed no significant relationship:

Metric Spearman rho p-value
Openness vs Environment effect -0.05 0.54
Openness vs Phylogeny effect 0.03 ...
Metal Specificity Plan Report
Paramvir S. Dehal

Research Question

Among the 12,838 metal-important genes identified by the Metal Fitness Atlas, which are specifically required for metal tolerance vs general stress survival — and do the metal-specific genes show the expected accessory-genome enrichment?

Key Findings

1. 55% of Metal-Important Genes Are Metal-Specific

Of the 7,609 metal-important gene records with fitness matrix data across 24 organisms, 4,177 (54.9%) are metal-specific — they show significant fitness defects under metal stress but a <5% sick rate across 5,945 non-metal experiments....

Lab Field Ecology Plan Report
Paramvir S. Dehal

Research Question

Do lab-measured fitness effects under contaminant stress predict the field abundance of Fitness Browser organisms across Oak Ridge groundwater sites with varying geochemistry?

Key Findings

14 of 26 Fitness Browser Genera Detected at Oak Ridge

Of 26 unique genera represented in the Fitness Browser, 14 are detected in Oak Ridge groundwater communities via 16S amplicon sequencing. The most prevalent are Sphingomonas (93% of 108 sites), Pseudomonas (91%), and Caulobacter...

PaperBLAST Data Explorer

Completed Needs Re-review
Paperblast Explorer Plan Report
Paramvir S. Dehal

Research Question

What does the kescience_paperblast collection contain, how current is it, and what are its coverage patterns across organisms, domains of life, and functional databases?

Key Findings

Finding 1: One organism dominates nearly half of all literature

Homo sapiens alone accounts for 46.7% of all gene-paper records in PaperBLAST. The top 5 organisms (H. sapiens, M. musculus, R. norvegicus, A. thaliana, D. melanogaster) capture 72.8%. Of 20,723 organisms...

Amr Fitness Cost Plan Report
Paramvir S. Dehal

Research Question

Do antimicrobial resistance (AMR) genes impose a fitness cost in the absence of antibiotic selection pressure? Using genome-wide RB-TnSeq fitness data from 28 bacteria, we test whether transposon knockouts of AMR genes show systematically positive fitness (mutant grows better than wildtype) under standard growth conditions, indicating the intact AMR gene is a metabolic burden.

Key Findings

1. Universal cost of resistance across 25 bacterial species (H1 supported)

AMR gene knockouts show systematically higher fitness than non-AMR gene knockouts under non-antibiotic conditions, confirming that resistance genes impose a metabolic burden. A DerSimonian-Laird random-effects...

Truly Dark Genes Plan Report
Adam Arkin

Research Question

Among the ~6,400 Fitness Browser genes that remain functionally unannotated even after bakta v1.12.0 reannotation, what distinguishes them from "annotation-lag" dark matter, and can their fitness phenotypes, genomic context, and sparse annotations prioritize them for experimental characterization?

Key Findings

Finding 1: Only 16.3% of "dark matter" resists modern annotation

Of 39,532 Fitness Browser dark genes with pangenome links, bakta v1.12.0 reannotation reclassifies 33,105 (83.7%) — leaving just 6,427 "truly dark" genes where both the original pipeline and bakta agree: these are hypothetical...

Functional Dark Matter Plan Report
Adam Arkin

Research Question

Which genes of unknown function across 48 bacteria have strong fitness phenotypes, and can biogeographic patterns, pathway gap analysis, and cross-organism fitness concordance — combined with existing function predictions and conservation data — prioritize them for experimental follow-up?

Key Findings

Finding 1: One in four bacterial genes is functionally dark, and 17,344 have experimentally measurable phenotypes

Across 48 Fitness Browser organisms (228,709 genes), 57,011 (24.9%) lack functional annotation ("hypothetical protein," DUF, or "uncharacterized"). Of these, 7,787 show strong...

Snipe Defense System Plan Report
Chris Mungall

Research Question

How prevalent are SNIPE (Surface-associated Nuclease Inhibiting Phage Entry) homologues across the 293K-genome BERDL pangenome, and does their taxonomic distribution, environmental context, or pangenome status (core vs. accessory) reveal ecological patterns of phage defense?

Key Findings

1. SNIPE resolves the phage resistance vs. metabolic cost trade-off

Saxton et al. (2026) showed that SNIPE constitutively localizes to the inner membrane and cleaves phage DNA as it passes through the ManYZ mannose transporter pore. Published knockout and coevolution studies (not from...

Nmdc Community Metabolic Ecology Plan Report
Christopher Neely

Research Question

Do the GapMind-predicted pathway completeness profiles of community resident taxa predict or
correlate with observed metabolomics profiles in NMDC environmental samples across diverse
habitat types?

Key Findings

Finding 1 — Black Queen dynamics are detectable at community scale

Across 13 testable amino acid biosynthesis pathways, 11 of 13 (85%) showed negative
Spearman correlations
between community pathway completeness and ambient amino acid
metabolite intensity — the direction predicted by...

Prophage Ecology Plan Report
Adam Arkin

Research Question

How are prophage gene modules and terminase-defined prophage lineages distributed across bacterial phylogeny and environmental gradients, and which modules/lineages show environmental enrichment exceeding phylogenetic expectation?

Key Findings

1. Prophage gene modules are universal but structurally variable across 27,702 bacterial species

All 27,702 species in the BERDL pangenome carry prophage-associated gene clusters, with 4,005,537 total prophage gene clusters identified via eggNOG annotations. Three modules are...

Phb Granule Ecology Plan Report
Adam Arkin

Research Question

How are polyhydroxybutyrate (PHB) granule-forming pathways distributed across bacterial clades and environments, and does this distribution support the hypothesis that carbon storage granules are most beneficial in temporally variable feast/famine environments?

Key Findings

Finding 1: PHB pathways are widespread but phylogenetically concentrated

Across 27,690 GTDB species, 21.9% carry phaC (PHA synthase, the committed step for PHB biosynthesis) and 21.7% have a complete PHB pathway (phaC + phaA/phaB). The near-identical prevalence of phaC-only a

Openness Functional Composition Plan
Justin Reese

Research Question

Do species with open pangenomes show different COG functional enrichment patterns than species with closed pangenomes?

Resistance Hotspots Plan
William J. Riehl

Research Question

Which microbial species and ecological environments show the highest concentration of antibiotic resistance genes, and can we predict resistance accumulation from phylogenetic and ecological features?