🧬

Pangenome Collection

kbase_ke_pangenome

Tenant: KBase · Snapshot 2026-04-29T21:35:53.838483+00:00

Primary

Schema status

discovered

Curation status

curated

Source

berdl-spark-connect://metrics.berdl.kbase.us:443

Philosophy

Enable comparative genomics at scale. Understand core vs accessory genome content, functional distributions, and evolutionary patterns across bacterial and archaeal species. Answer questions about what makes species unique and what they share.

Data sources: GTDB r214 eggNOG v6 GapMind AlphaEarth

Citation & Attribution

Provider: KBase, DOE

Cite as: Arkin AP et al. (2018) KBase: The United States Department of Energy Systems Biology Knowledgebase. Nat Biotechnol 36:566-569

DOI: 10.1038/nbt.4163

Website: https://www.kbase.us/

Scale

293,059
genomes
27,690
species
1B+
genes
14
tables

Schema Browser

Sample Queries

Get species with most genomes

SELECT
  s.GTDB_species,
  p.no_genomes,
  p.no_core,
  p.no_aux_genome,
  s.mean_intra_species_ANI
FROM kbase_ke_pangenome.pangenome p
JOIN kbase_ke_pangenome.gtdb_species_clade s
  ON p.gtdb_species_clade_id = s.gtdb_species_clade_id
ORDER BY p.no_genomes DESC
LIMIT 20

Get functional annotations for a species

SELECT
  e.COG_category,
  e.Description,
  COUNT(*) as gene_count
FROM kbase_ke_pangenome.gene g
JOIN kbase_ke_pangenome.eggnog_mapper_annotations e
  ON g.gene_id = e.gene_id
WHERE g.gtdb_species_clade_id = 's__Escherichia_coli--RS_GCF_000005845.2'
GROUP BY e.COG_category, e.Description
ORDER BY gene_count DESC
LIMIT 20

Get genomes for a species with quality metrics

SELECT
  g.genome_id,
  g.ncbi_biosample_id,
  m.checkm_completeness,
  m.checkm_contamination,
  m.genome_size,
  m.gc_percentage
FROM kbase_ke_pangenome.genome g
LEFT JOIN kbase_ke_pangenome.gtdb_metadata m
  ON g.genome_id = m.accession
WHERE g.gtdb_species_clade_id = 's__Escherichia_coli--RS_GCF_000005845.2'
LIMIT 100

Related Collections

Projects Using This Collection

AlphaFold MSA Depth as a Lens on the Bacterial Annotation Gap

Does AlphaFold MSA depth predict functional annotation richness in the bacterial pangenome, and is structural novelty (l...

Lignin Enrichment and Ecological Memory in Microbial Communities

How does sequential enrichment on lignin, with or without labile carbon supplementation, shape bacterial and fungal comm...

ENIGMA Carbon Census 1

For 83 groundwater- and necromass-derived carbon compounds proposed for community enrichment and isolate phenotyping, wh...

Caulobacter Fur–Lipid A Loss

Why does inactivation of *fur* (the ferric uptake regulator) permit the loss of lipid A in *Caulobacter crescentus*, whe...

BERDL Data Atlas — Inventory, Topic Map, and Cross-Reference Synergies

What data is available in BERDL (across tenants, agencies, and programs), what biological topics does it cover, and wher...

Annotation-Gap Discovery via Phenotype-Fitness-Pangenome-Gapfilling Integration

Can we systematically identify and resolve metabolic annotation gaps in bacterial genomes by integrating experimental gr...

Soil Microbial Dark Matter: Genomic Frontiers and the Clay Shield Null Result

Soil Metal Concentrations Drive Functional Gene Shifts in the Environmental Microbiome

Which microbial functional gene categories (COGs) are significantly associated with soil metal concentrations at global ...

Metal Resistance Ecology: Phylogenetic Conservation vs. Environmental Selection

Do metal-resistance functions in the global microbiome reflect phylogenetic constraint (conserved in lineages) or enviro...

Lanthanide Methylotrophy Atlas

Across BERDL's 293K-genome pangenome, what is the phylogenomic distribution and cassette-completeness of the lanthanide-...

Prophage-AMR Co-mobilization Atlas

At pangenome scale (293K genomes, 27K species), are antibiotic resistance genes preferentially located within or adjacen...

Self-Sufficiency, Anaerobic Toolkit, and Cultivation Bias in Clay-Confined Cultured Bacterial Genomes

Do BERDL's cultured bacterial genomes from clay-confined deep-subsurface environments recapitulate the genomic signature...

Subsurface Bacillota_B Specialization

Within the Bacillota_B phylum (Desulfosporosinus, BRH-c8a Peptococcaceae, BRH-c4a Desulfotomaculales, and other obligate...

Gene Function Ecological Agora

Across the prokaryotic tree (GTDB r214; 293,059 genomes / 27,690 species), build a multi-resolution **innovation + acqui...

Gene-Resolution Metal Cross-Resistance Across Diverse Bacteria

Is the genetic architecture of metal cross-resistance conserved across phylogenetically diverse bacteria, or is it rewir...

Ecotype Functional Differentiation

Do gene-content ecotypes within bacterial species differ in their COG functional profiles, showing differentiation in ad...

Metagenome-Prioritized Phage Cocktails for Crohn's Disease and IBD

Which bacterial pathobionts are **enriched, ubiquitous, and non-protective** in IBD, UC, and Crohn's disease patients — ...

Plant Microbiome Ecotypes

What is the genomic basis for plant-microbe associations across different plant compartments (rhizosphere, root, phyllos...

Genotype × Condition → Phenotype Prediction from ENIGMA Growth Curves

Can we predict bacterial growth phenotype — at multiple resolutions from binary growth through continuous kinetics to co...

Acinetobacter baylyi ADP1 Data Explorer

What is the scope and structure of a comprehensive ADP1 database, and how do its annotations, metabolic models, and phen...

SSO Subsurface Community Ecology — Spatial Structure, Functional Gradients, and Hydrogeological Drivers

Does 16S community similarity across the 9 ENIGMA SSO wells (3x3 grid, ~4 m span at Oak Ridge) recapitulate the spatial ...

Carbon Source Utilization Predicts Ecology and Lifestyle in Pseudomonas

Among free-living *Pseudomonas* clades, does the carbon source utilization profile predict the soil ecosystem type from ...

CF Protective Microbiome Formulation Design

Can we build a multi-criterion framework that explains measured *P. aeruginosa* PA14 inhibition from metabolic competiti...

PGP Gene Distribution Across Environments & Pangenomes

Does environmental selection shape the distribution of plant growth-promoting (PGP) bacterial genes across the BERDL pan...

Environmental Resistome at Pangenome Scale

Do antimicrobial resistance gene profiles differ between ecological niches across 27,000 bacterial species? Using 83K AM...

AMR Co-Fitness Support Networks

What genes are co-regulated with antimicrobial resistance (AMR) genes across growth conditions, and do these "support ne...

Counter Ion Effects on Metal Fitness Measurements

When bacteria are exposed to metal salts (CoCl₂, NiCl₂, CuCl₂), how much of the observed fitness effect is caused by the...

Field vs Lab Gene Importance in *Desulfovibrio vulgaris* Hildenborough

Which genes matter for survival under environmentally-realistic conditions but appear dispensable in the lab, and vice v...

Metabolic Capability vs Metabolic Dependency

Just because a bacterium's genome encodes a complete metabolic pathway (metabolic *capability*), does the organism actua...

Within-Species AMR Strain Variation

Within a species, how does the AMR repertoire vary between strains, and what drives that variation?

Pan-Bacterial Metal Fitness Atlas

Across diverse bacteria subjected to genome-wide fitness profiling under metal stress, what is the genetic architecture ...

The Pan-Bacterial Essential Metabolome

Which biochemical reactions are universally essential across bacteria, and what does the essential metabolome reveal abo...

ADP1 Deletion Collection Phenotype Analysis

What is the condition-dependent structure of gene essentiality in *Acinetobacter baylyi* ADP1, as revealed by the de Ber...

Pan-Bacterial AMR Gene Landscape

What is the distribution, conservation, phylogenetic structure, functional context, and environmental association of ant...

Contamination Gradient vs Functional Potential in ENIGMA Communities

Do high-contamination Oak Ridge groundwater communities show enrichment for taxa with higher inferred stress-related fun...

Ecotype Correlation Analysis

What drives gene content similarity between bacterial genomes: environmental similarity or phylogenetic relatedness?

Web of Microbes Data Explorer

What does the `kescience_webofmicrobes` exometabolomics collection contain, which organisms overlap with the Fitness Bro...

Ecotype Reanalysis: Environmental-Only Samples

Does the environment effect on gene content become stronger when analysis is restricted to genuinely environmental sampl...

Condition-Specific Respiratory Chain Wiring in ADP1

How is *Acinetobacter baylyi* ADP1's branched respiratory chain wired across carbon sources — which NADH dehydrogenases ...

Conservation vs Fitness -- Linking FB Genes to Pangenome Clusters

Are essential genes preferentially conserved in the core genome, and what functional categories distinguish essential-co...

Temporal Core Genome Dynamics

How does core genome composition change over sampling time, and do genes transition in and out of core status?

BacDive Phenotype Signatures of Metal Tolerance

Can BacDive-measured bacterial phenotypes (Gram stain, oxygen tolerance, metabolite utilization, enzyme activities) pred...

COG Functional Category Analysis

How do COG functional category distributions differ across core, auxiliary, and novel genes in bacterial pangenomes?

The 5,526 Costly + Dispensable Genes

What characterizes genes that are simultaneously burdensome (fitness improves when deleted) and not conserved in the pan...

AlphaEarth Embeddings, Geography & Environment Explorer

What do AlphaEarth environmental embeddings capture, and how do they relate to geographic coordinates and NCBI environme...

Aromatic Catabolism Support Network in ADP1

Why does aromatic catabolism in *Acinetobacter baylyi* ADP1 require Complex I (NADH dehydrogenase), iron acquisition, an...

Metabolic Capability vs Dependency

When a bacterium's genome encodes a complete biosynthetic or catabolic pathway, does the organism actually depend on it?...

Co-fitness Predicts Co-inheritance in Bacterial Pangenomes

Do genes with correlated fitness profiles (co-fit) tend to co-occur in the same genomes across a species' pangenome? Doe...

Metabolic Consistency of Pseudomonas FW300-N2E3

For *Pseudomonas fluorescens* FW300-N2E3 (ENIGMA groundwater isolate), how consistent are exometabolomic outputs (Web of...

Pan-bacterial Fitness Modules via Independent Component Analysis

Can we decompose RB-TnSeq fitness compendia into latent functional modules via robust ICA, align them across organisms u...

BacDive Isolation Environment × Metal Tolerance Prediction

Do bacteria isolated from metal-contaminated environments have higher predicted metal tolerance scores than bacteria fro...

Pangenome Openness Analysis

Do open pangenomes show different patterns of environmental vs phylogenetic effects compared to closed pangenomes?

Fitness Cost of Antimicrobial Resistance Genes

Do antimicrobial resistance (AMR) genes impose a fitness cost in the absence of antibiotic selection pressure? Using gen...

Truly Dark Genes — What Remains Unknown After Modern Annotation?

Among the ~6,400 Fitness Browser genes that remain functionally unannotated even after bakta v1.12.0 reannotation, what ...

Functional Dark Matter — Experimentally Prioritized Novel Genetic Systems

Which genes of unknown function across 48 bacteria have strong fitness phenotypes, and can biogeographic patterns, pathw...

SNIPE Defense System: Prevalence and Taxonomic Distribution in the BERDL Pangenome

How prevalent are SNIPE (Surface-associated Nuclease Inhibiting Phage Entry) homologues across the 293K-genome BERDL pan...

Community Metabolic Ecology via NMDC × Pangenome Integration

Do the GapMind-predicted pathway completeness profiles of community resident taxa predict or correlate with observed met...

Prophage Gene Modules and Terminase-Defined Lineages Across Bacterial Phylogeny and Environmental Gradients

How are prophage gene modules and terminase-defined prophage lineages distributed across bacterial phylogeny and environ...

Polyhydroxybutyrate Granule Formation Pathways: Distribution Across Clades and Environmental Selection

How are polyhydroxybutyrate (PHB) granule-forming pathways distributed across bacterial clades and environments, and doe...

Openness vs Functional Composition

Do species with open pangenomes show different COG functional enrichment patterns than species with closed pangenomes?

Antibiotic Resistance Hotspots in Microbial Pangenomes

Which microbial species and ecological environments show the highest concentration of antibiotic resistance genes, and c...

Pangenome Openness, Metabolic Pathways, and Biogeography

Do pangenome characteristics (open vs. closed) correlate with metabolic pathway diversity and biogeographic distribution...

Pangenome Openness, Metabolic Pathways, and Phylogenetic Distances

How do pangenome characteristics (open vs. closed) correlate with metabolic pathway completeness, phylogenetic distances...

Atlas Pages

meta

Claims

Entry point for reusable evidence-backed statements that can be cited across topics, directions, hypotheses, and data products.

claim

Metal-specific genes remain core-enriched

Metal-specific genes are functionally distinct from general stress genes but remain enriched in the core genome.

claim

Lab fitness can predict field ecology

Several projects suggest that lab-measured fitness signals can align with environmental abundance or isolation context when validation data are available.

claim

AMR mechanism composition is environment-structured

AMR mechanisms differ by environment, making resistance ecology a field and context problem rather than only a clinical annotation problem.

claim

Ecotype analyses need rigor gates before translation

Ecotype-derived target lists can collapse under leakage, confound, and independent-evidence checks, so translation requires explicit gates.

claim

Lanthanide-dependent methylotrophy is widespread and soil-linked

XoxF markers are far more common than canonical MxaF markers across the BERDL pangenome, with strong soil/sediment enrichment and important marker-calibration caveats.

claim

Pangenome openness shapes functional opportunity

Pangenome openness and gene-content class affect which functions are stable, variable, mobile, or available for niche adaptation.

claim

Prophage density predicts AMR repertoire breadth

Pangenome-scale prophage marker density is a strong species-level predictor of AMR breadth, while gene-level AMR-prophage proximity is weaker and threshold-sensitive.

claim

Metal type diversity predicts ecological niche breadth

Genus-level metal resistance type diversity predicts broader ecological niche breadth after phylogenetic control, while total AMR burden is less informative.

conflict

Ecotype labels versus translational leakage

Ecotype labels are reusable stratification products, but translational target lists can collapse when labels and outcomes share leaked or confounded features.

conflict

Metal specificity versus general stress

Metal fitness hits include both specific metal biology and broad stress response, so engineering targets need specificity filters and counter-ion controls.

conflict

Metal-AMR co-selection readiness

BERIL has the pieces to test metal-AMR co-selection, but the current evidence is a strong opportunity rather than a resolved result.

meta

BERDL Data Atlas

Entry point for BERDL tenants, collections, data types, derived products, join recipes, reuse patterns, and missing complementary data.

meta

Derived Data Reuse Graph

Canonical Atlas page for project-to-project reuse, derived products, output artifacts, consumers, and review routes.

data tenant

KBase Tenant and Foundational Genome Data

KBase-provided genome, pangenome, phenotype, and ModelSEED resources that form the core comparative-genomics substrate.

data tenant

KEScience Tenant and Fitness Evidence

Fitness Browser and related KEScience resources that provide experiment-backed genotype-to-phenotype evidence.

data tenant

ENIGMA Tenant and Contaminated-Site Context

ENIGMA CORAL and related contaminated-site data that connect microbial genomics and communities to DOE field environments.

data collection

Pangenome Collection

Genome and pangenome tables for comparative genomics, conservation, openness, annotation, and cross-project derived products.

data collection

Fitness Browser Collection

RB-TnSeq fitness evidence used to validate gene function, stress response, essentiality, cofitness, and pathway dependency.

data type

Genomes and Pangenomes

Data type lens for genome metadata, pangenome structure, annotations, and species-level comparative genomics.

data type

Fitness Phenotypes

Data type lens for RB-TnSeq phenotypes, condition-specific gene effects, cofitness, essentiality, and pathway dependency.

data type

Genes, Proteins, and Annotations

Gene and protein identifiers, functional annotation, literature coverage, orthology, and controlled vocabulary layers used to connect raw genomes to interpretable biology.

data type

Metabolism, Biochemistry, and Pathways

Biochemical reference data and predicted or measured metabolic capabilities that support pathway-level interpretation and community design.

data type

Environment, Geochemistry, and Ecology

Environmental samples, coordinates, geochemistry, sample metadata, and ecology-facing observations used to validate laboratory predictions in the field.

data type

Multi-Omics, Embeddings, and Molecular Profiles

Metabolomics, proteomics, trait profiles, embeddings, and other matrix-style summaries that create reusable sample or organism representations.

data type

Phage, Mobile Elements, and Defense

Host-specific phage browsers, pathogen genome views, and mobile-element signals used to reason about host range, defense, resistance, and engineered interventions.

derived product

Metal Tolerance Scores

Reusable species and gene-level metal tolerance signals derived from metal fitness projects and environmental validation work.

derived product

Ecotype Assignments

Reusable within-species or community ecotype labels that support environmental validation, microbiome stratification, and downstream hypothesis tests.

derived product

Environment Harmonization Labels

Reusable environment category and coordinate-quality labels that make cross-collection ecology joins safer.

derived product

Dark Gene Prioritization Tables

Reusable ranked dark-gene candidates, covering sets, and experiment plans derived from fitness, pangenome, annotation, and ecology evidence.

derived product

Functional Innovation KO Atlas

Reusable clade-level functional innovation and acquisition-depth outputs from the ecological agora project.

derived product

AMR Fitness Profiles

Reusable AMR mechanism, conservation, environment, and fitness-cost signals for resistance ecology questions.

derived product

Pangenome Openness Metrics

Reusable openness and conservation metrics that connect gene-content architecture to function and ecology.

join recipe

Genome-Fitness-Pangenome Join

Reusable join pattern that connects gene families, genome conservation, and RB-TnSeq fitness phenotypes.

derived product

CF Formulation Scores

Reusable formulation scoring outputs for CF airway community design and Pseudomonas competition analysis.

data gap

Rare Earth Fitness Data Gap

Rare-earth-element fitness experiments appear absent, making cross-metal inference a prediction task rather than validated biology.

meta

Research Directions

Entry point for high-value research opportunities synthesized from existing BERIL evidence, data products, and gaps.

direction

Gene targets for critical-mineral bioprocessing

Use metal-specific fitness signals, annotations, modules, and structures to prioritize genes for bioleaching and biorecovery.

direction

Metal-AMR co-selection at contaminated sites

Test whether metal contamination selects for AMR genes, mechanisms, or support networks across DOE-relevant environments.

direction

Rare-earth gene discovery via cross-metal inference

Use cross-metal response structure to rank candidates for first rare-earth-element fitness experiments.

direction

Fitness-validated community design

Design microbial communities using tolerance, metabolic capability, measured dependency, and risk annotations.

direction

Derived data product catalog

Make high-value project outputs discoverable, reusable, and reviewable as first-class Atlas data objects.

meta

Testable Hypotheses

Entry point for concrete hypotheses that can become projects, analyses, or experiments.

hypothesis

Bakta reannotation resolves novel metal families

Updated Bakta annotations explain a subset of previously novel metal-important gene families.

hypothesis

Structures support metal-binding or transport roles

Top metal-specific candidates have predicted structures or topology consistent with metal binding, transport, or regulation.

hypothesis

Metal contamination co-selects AMR mechanisms

Metal-contaminated environments carry higher AMR burden or different AMR mechanism composition after controlling for taxonomy and site context.

hypothesis

Metal tolerance scores predict field isolation context

Species with higher metal tolerance scores are enriched in metal-associated isolation or contaminated-site contexts.

hypothesis

Open pangenomes carry broader pathway diversity

Species with more open pangenomes have broader metabolic pathway repertoires after controlling for genome count and taxonomy.

hypothesis

Derived product reuse predicts observatory value

Derived products reused across projects are better indicators of observatory value than raw project count alone.

meta

Atlas Review Queue

Maintainer queue for deepening flagship topics, promoting reusable products, and resolving Atlas tensions.

method

New Project Integration Pattern

Procedure for integrating newly completed projects into the BERIL Atlas without turning topic pages into project-summary dumps.

opportunity

Rare-Earth RB-TnSeq Design

Design the first rare-earth fitness experiment by ranking candidate genes from cross-metal specificity, conservation, annotation, and structure evidence.

opportunity

Metal-AMR Site Co-Selection Analysis

Test whether metal contamination at BER-relevant sites co-selects antibiotic resistance mechanisms using metal fitness, AMR profiles, and environmental metadata.

opportunity

Ecotype Label Validation Benchmark

Build a benchmark that tests whether ecotype labels survive stricter metadata, batch, and holdout validation.

opportunity

Functional Innovation KO Atlas Reuse Test

Test whether the Functional Innovation KO Atlas helps explain pangenome, pathway, or plant-microbiome signals beyond generic annotation summaries.

opportunity

Dark Gene Structure Prioritization

Prioritize dark gene families for mechanistic review by joining fitness, cofitness, annotation novelty, and AlphaFold structure signals.

opportunity

Pangenome Openness Confounder Audit

Audit whether openness-function relationships remain after controlling for taxonomy, genome quality, sampling density, and annotation completeness.

opportunity

Phage Host-Range Reuse Map

Map where phage and mobile-element project outputs can become reusable host-range, defense, or genome-plasticity products.

opportunity

Plant Microbiome Function Validation

Validate whether plant microbiome functional signals persist across ecotype labels, pangenome context, and environmental metadata.

opportunity

Derived Product Readiness Burn-Down

Review candidate and promoted derived products to close missing consumers, artifacts, caveats, and review routes before they become default inputs.

person

Expertise Map

Topic-to-reviewer routing scaffold based on project ownership and source provenance.

atlas

BERIL Atlas

Entry point for the BERIL Atlas over projects, data, claims, directions, hypotheses, methods, and contributor provenance.

meta

Topic Landscape

Entry point for broad science and application syntheses that connect BERIL projects to claims, directions, hypotheses, and data.

topic

Critical Minerals and Metal Biology

Progressive synthesis of metal fitness, tolerance, validation, and critical-mineral research opportunities.

topic

AMR, Resistance Ecology, and Co-selection

Synthesis of AMR gene distribution, fitness cost, cofitness support networks, environment structure, and metal co-selection opportunities.

topic

Pangenome Architecture and Gene-Content Evolution

Cross-project synthesis of pangenome openness, core/accessory structure, functional composition, conservation, and gene-content tradeoffs.

topic

Fitness-Validated Gene Function

Synthesis of essential genes, metabolic dependency, ICA modules, dark genes, and functional annotation repair through fitness evidence.

topic

Microbial Ecotypes, Environment, and Field Validation

Synthesis of species-level ecotypes, environmental embeddings, lab-field validation, ENIGMA ecology, and metadata limitations.

topic

Host Microbiome Translation

Synthesis of IBD phage targeting, formulation design, metabolomics caveats, patient stratification, and intervention cost accounting.

topic

Plant Microbiome Function and Agriculture

Synthesis of plant-associated microbial function, beneficial/pathogenic duality, compartment structure, PGP markers, and pangenome ecology.

topic

Mobile Elements, Phage, and Genome Plasticity

Synthesis of phage ecology, prophage signals, defense systems, mobile-element gene flow, and intervention relevance.

topic

Metabolic Capability, Dependency, and Community Design

Synthesis of GapMind capability, fitness dependency, metabolic models, community ecology, and design-ready derived data.

Atlas Reuse

Start Exploring

Access the full Pangenome Collection data through BERDL JupyterHub.

Open JupyterHub