AlphaFold MSA Depth as a Lens on the Bacterial Annotation Gap

Gazi S. Mahmud | ORCID: 0009-0006-4046-889X | Lawrence Berkeley National Laboratory

Research Question

Does AlphaFold MSA depth predict functional annotation richness in the bacterial pangenome, and is structural novelty (low MSA depth) systematically enriched in accessory and singleton gene families relative to core genes?

Overview

AlphaFold assigns each predicted protein structure an MSA depth score — the number of evolutionary homologs used to build the multiple sequence alignment underpinning the prediction. Proteins with low MSA depth are structurally novel: few homologs exist in sequence databases, making them hard to predict and typically poorly annotated.

This project maps MSA depth onto the BERDL pangenome (293K genomes, 27K species, 132M gene clusters) to ask: where does structural novelty live in the core vs. accessory genome? We test whether low-MSA-depth proteins are disproportionately hypothetical, lack domain annotations, and are enriched in the accessory/singleton genome — and identify the exceptional "paradox proteins": core genes that are universally conserved yet structurally uncharacterized.

Key databases used:
- kbase_ke_pangenome — gene_cluster, bakta_annotations, interproscan_domains
- kescience_alphafold — alphafold_msa_depths

Key Findings

H1: Core genes are 2.9× more structurally represented than accessory genes

$MSA depth by pangenome class, hypothetical rates, and structurally novel fraction$

Core gene clusters have a median MSA depth of 15,308 — 2.89× higher than the auxiliary+singleton median of 5,299 and 2.77× higher than the auxiliary non-singleton median of 5,527. The separation is most pronounced at the low end: the 10th-percentile MSA depth is 334 for core genes vs. 25–32 for accessory genes. With groups of 5–25 million clusters, the effect size alone makes the result unambiguous.

(Notebook: NB01_data_extraction.ipynb, NB03_statistical_analysis.ipynb)

H3: MSA depth strongly predicts domain annotation richness (Spearman ρ = 0.756)

Across 38,051,842 gene cluster–UniProt pairs, MSA depth and domain hit count are strongly positively correlated (Spearman ρ = 0.756). The relationship is monotone and large in magnitude:

MSA depth bin	Core clusters	Mean domain hits	Mean distinct IPR
< 10	415,733	0.59	0.059
10–99	1,143,785	0.96	0.242
100–999	2,301,137	1.95	0.852
1,000–4,999	3,126,558	3.72	1.843
5,000–9,999	2,591,011	5.62	2.664
≥ 10,000	15,993,075	10.83	4.601

The lowest MSA bin (< 10) averages fewer than one domain hit per cluster, while the highest bin averages nearly eleven — an 18× span. This confirms that AlphaFold MSA depth is not merely a structural confidence metric: it is a reliable proxy for functional annotation richness.

(Notebook: NB02_domain_annotation.ipynb, NB02b_msa_domain_join.ipynb)

H4: 415,603 "paradox proteins" — core genes that are universally conserved yet structurally unprecedented

Metric	Value
Distinct core clusters with msa_depth < 10	415,603
Species clades represented	14,768
Mean / median MSA depth	4.57 / 4.0
Hypothetical (no functional annotation)	286,439 (68.9%)
EC-annotated	137 (0.033%)
KEGG-mapped	346 (0.083%)

The paradox proteins are qualitatively different from core genes overall: while the global core hypothetical rate is 3.8%, within the paradox subset it is 68.9%. EC and KEGG annotations are essentially absent (< 0.1%). These genes are conserved across thousands of bacterial lineages, yet almost nothing is known about them at the sequence, structure, or biochemical level.

(Notebook: NB04_paradox_proteins.ipynb, NB04b_paradox_ranking.ipynb)

H2: The annotation gap tracks pangenome class — but paradox proteins invert the expectation within core

Hypothetical protein rates decrease sharply from accessory to core:

Class	Clusters	Hypothetical rate
Auxiliary + singleton	7,095,643	13.8%
Auxiliary non-singleton	5,384,900	11.6%
Core	25,571,299	3.8%

Chi-square tests are overwhelmingly significant (χ² > 500,000; p ≈ 0), with odds ratios of 0.25 (core vs. aux+singleton) and 0.31 (core vs. aux non-singleton), confirming that core genes are far less likely to be hypothetical than accessory genes. This appears to contradict H2 as literally stated — but the paradox protein census (H4) resolves the tension: within core genes, the subset with msa_depth < 10 has a 68.9% hypothetical rate, versus 3.8% for all core genes. The structural novelty signal exists within every pangenome class; the cross-class comparison is dominated by the higher average MSA depth of core genes.

(Notebook: NB03_statistical_analysis.ipynb)

Results

Dataset coverage

Of 132,531,501 total gene clusters in kbase_ke_pangenome.gene_cluster:
- 38,804,903 (29.3%) have a real UniProt accession via bakta_annotations.uniref100
- 38,051,842 (28.7%) bridge successfully to kescience_alphafold.alphafold_msa_depths
- The remaining 70.7% lack UniRef100 IDs or carry UniParc-only (UPI-prefixed) identifiers with no AlphaFold entry

This 29.3% coverage is not random: well-characterised organisms with established reference proteomes (e.g., E. coli, Pseudomonas, Bacillus) are over-represented in UniProt. The analysis is therefore biased toward the better-studied portion of bacterial diversity, and the annotation gap for the remaining 70.7% is likely larger.

Pangenome-class breakdown (bridged clusters)

Class	n clusters	Median MSA depth	p10	p90	Very low (< 10)	Hypothetical
Core	25,571,299	15,308	334	19,500	415,733 (1.6%)	979,912 (3.8%)
Aux non-singleton	5,384,900	5,527	32	19,192	245,002 (4.6%)	622,748 (11.6%)
Aux + singleton	7,095,643	5,299	25	19,203	392,959 (5.5%)	979,300 (13.8%)

Domain annotation (all clusters with ≥ 1 InterProScan hit)

111,035,431 gene clusters (83.8% of all clusters) have at least one domain annotation. Mean hits = 7.5, mean distinct IPR families = 3.3. This coverage figure is substantially higher than the AlphaFold bridge coverage (29.3%), confirming that domain-based annotation (which uses sequence profiles, not structural homology) reaches further into sequence space.

H3 full correlation result

Spearman ρ = 0.7563 (n = 38,051,842 pairs). The monotone relationship between MSA depth bin and domain hits (shown in the H3 table above) holds within all three pangenome classes (core, auxiliary non-singleton, auxiliary+singleton), with core genes consistently showing slightly higher domain richness per MSA bin than accessory genes at equivalent depth.

H4 top paradox proteins

The top-ranked paradox proteins by msa_depth = 1 come primarily from poorly characterised marine and soil bacteria (Oceanicoccus, Dwaynesavagella, CAILRJ01). Non-hypothetical entries at msa_depth = 1 include an RNA polymerase ω-subunit family protein and an FXSXX-COOH domain protein — both conserved structural components with no solved structure for these specific lineages.

Interpretation

AlphaFold MSA depth as a functional annotation proxy

AlphaFold's MSA depth is conventionally presented as a structural confidence metric: higher depth yields higher pLDDT. This analysis demonstrates a second, independent utility: MSA depth is a strong proxy for how much is already known about a protein functionally. The Spearman ρ of 0.756 between MSA depth and InterProScan domain hits, computed across 38 million gene-cluster–UniProt pairs, shows that the two are nearly interchangeable as measures of "how deeply a protein is embedded in existing knowledge." This is interpretable: MSA depth reflects the number of detectable evolutionary relatives in sequence databases, and those relatives are the same proteins whose characterisation generated domain profiles in Pfam, Gene3D, SUPERFAMILY, and PANTHER.

The core/accessory gradient in structural knowledge

The 2.9× difference in median MSA depth between core and accessory genes confirms the expected pattern: genes under long-term purifying selection in core genomes have more evolutionary relatives and are better represented in sequence databases. However, the p10 comparison is more revealing than the median: the bottom 10% of core genes still have MSA depth ≥ 334, while the bottom 10% of accessory genes have MSA depth ≤ 25–32. This means even the worst-represented core genes are substantially better characterised than the typical accessory gene. The knowledge gradient is steep.

The paradox proteins: a structural biology priority list

The 415,603 core gene clusters with msa_depth < 10 represent the most scientifically actionable finding. These proteins satisfy two competing criteria that individually might not be alarming, but together define a specific knowledge gap:

Conservation (is_core = true): present in the majority of genomes within each of 14,768 species clades — implying strong purifying selection and likely important biological function.
Structural novelty (msa_depth < 10): fewer than 10 detectable homologs in the entire AlphaFold training set — meaning their fold is genuinely unprecedented in the available sequence databases.

A 68.9% hypothetical rate and near-zero EC/KEGG coverage confirm that these proteins have not been assigned function by any standard informatic route. They are not merely unannotated — they are structurally isolated from all characterised protein space.

These proteins are prime candidates for experimental structural characterisation. Unlike random hypothetical proteins (which may be spurious ORFs or very species-specific), these are functionally important by virtue of deep conservation. Unlike typical dark proteins (which may simply lack sequence relatives due to divergence), these have msa_depth = 1–9, meaning even deep homology searches find essentially nothing.

H2 resolution: two layers of annotation gap

The apparent contradiction in H2 — core genes have lower hypothetical rates yet harbour 415K structurally unprecedented proteins — resolves by recognising two distinct drivers of the annotation gap:

MSA-depth-driven gap: applies to all classes, explains the 18× span in domain hits, and is captured by H3.
Pangenome-class gap: accessory/singleton genes carry more hypothetical proteins independent of MSA depth, likely due to horizontal transfer, rapid evolution, and taxonomically narrow distribution.

Within core genes, the MSA-depth-driven gap applies strongly to the paradox subset (68.9% hypothetical) but not to the majority of core genes, which have high MSA depth and correspondingly rich annotations.

Literature context

The finding that core bacterial genes have higher sequence database representation than accessory genes is consistent with the foundational observation in pangenomics (Tettelin et al. 2005, Science) that core genes perform conserved housekeeping functions while accessory genes encode adaptive traits — a distinction that naturally produces differential sequence database coverage. However, prior analyses have not quantified this gradient at structural-novelty resolution using a database of 241M AlphaFold entries.

The AlphaFold Protein Structure Database (Varadi et al. 2022) and its 2025 update (Bertoni et al. 2026) have expanded structural coverage to include predicted structures for hundreds of millions of proteins. Schaeffer et al. (2026) classify AFDB Swiss-Prot entries and find > 100,000 domains with no Pfam mapping, highlighting that even within well-characterised reference proteomes, structure-based classification exposes significant annotation gaps. The present analysis extends this observation to the full bacterial pangenome scale (293K species, 132M gene clusters), demonstrating that the annotation gap is not uniformly distributed: it is concentrated in accessory genes and — paradoxically — in a specific subset of highly conserved core genes.

Tunyasuvunakool et al. (2021) noted that only 17% of human protein residues were covered by experimental structures before AlphaFold; the situation in bacteria is similar. MSA depth provides a way to stratify which of the remaining proteins are most tractable (high MSA depth — structure prediction is reliable) vs. most novel (low MSA depth — structure prediction is uncertain and experimental determination is most needed).

Limitations

29.3% bridge coverage: Only gene clusters with a non-UPI UniProt accession are included. The 71% without AlphaFold entries skews toward poorly characterised organisms and likely harbours a larger annotation gap than the analysed subset.
Representative sequence bias: MSA depth is looked up for the gene cluster's representative sequence only. Within-cluster sequence diversity is ignored; the representative may have a higher or lower MSA depth than the typical cluster member.
Taxonomic imbalance: The 293K genomes are not phylogenetically balanced (common taxa like Pseudomonas and E. coli are over-represented). Core gene counts and MSA depth distributions are influenced by this sampling.
H3 quantification: Spearman ρ = 0.756 is computed on the full 38M-pair dataset without subgroup stratification. The correlation likely differs between core and accessory genes, and between narrow-annotation-gap organisms and broad-annotation-gap organisms.
Static snapshot: The BERDL AlphaFold database is a single version-6 snapshot. Newly deposited UniProt entries may change MSA depths as databases grow.

Future Directions

Structural characterisation of top paradox proteins: The data/paradox_top1000.csv file provides a prioritised list of core genes with msa_depth = 1–9 for submission to structural genomics consortia (e.g., SGC, JCSG). The top candidates are targets for cryo-EM or AlphaFold-guided experimental determination.
H3 within-class stratification: Run H3 separately for core, auxiliary, and singleton clusters to determine whether the MSA depth → domain richness gradient is steeper in one class. This would clarify whether structural novelty has the same functional meaning across the pangenome.
ESM language model as a complementary novelty axis: ESMFold (Lin et al. 2023) does not use MSA depth and thus provides an orthogonal structural novelty signal. Comparing ESM pLDDT to AlphaFold MSA depth for the same clusters would identify proteins that are novel in sequence space but potentially foldable without homologs.
Paradox protein phylogenetic distribution: Map the 415K paradox clusters onto the GTDB phylogeny to identify which bacterial phyla and families harbour the highest densities of conserved-yet-novel proteins. This could reveal unexplored lineages with disproportionately high discovery potential.
Cross-reference with fitness data: Join the paradox protein list to kescience_fitnessbrowser fitness scores (where available) to identify paradox proteins that are essential for growth in at least one condition — providing experimental evidence of function even without annotation.

Data

Sources

Collection	Tables Used	Purpose
`kbase_ke_pangenome`	`gene_cluster`, `bakta_annotations`, `interproscan_domains`	Gene cluster classification, annotation quality, UniRef100 bridge, domain hits
`kescience_alphafold`	`alphafold_msa_depths`	MSA depth per UniProt accession

Generated Data

File	Rows	Description
`data/gc_msa_agg.csv`	3	Per-pangenome-class aggregate: MSA depth percentiles, hypothetical counts, EC/KEGG rates
`data/gc_domain_agg.csv`	4	Domain hit distribution by bin (1-2, 3-5, 6-10, 11+) across all 111M annotated clusters
`data/gc_domain_sample.csv`	100,000	Random sample of gene clusters with domain hit counts (NB02 output)
`data/gc_msa_domain_agg.csv`	18	Mean domain hits by MSA depth bin × pangenome class (NB02b)
`data/gc_h3_spearman.csv`	1	H3 Spearman ρ = 0.756, n = 38,051,842 (NB02b)
`data/gc_msa_domain_sample.csv`	~200,000	Per-cluster sample with both msa_depth and domain counts for scatter plots (NB02b)
`data/paradox_summary.csv`	1	Summary statistics for 415,603 paradox proteins (NB04b)
`data/paradox_top1000.csv`	1,000	Top-1000 paradox proteins ranked by msa_depth ascending (NB04b)

References

Jumper J et al. (2021). "Applying and improving AlphaFold at CASP14." Proteins. PMID: 34599769
Tunyasuvunakool K et al. (2021). "Highly accurate protein structure prediction for the human proteome." Nature. PMID: 34293799
Varadi M et al. (2022). "AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space." Nucleic Acids Res. PMID: 34791371
Varadi M & Velankar S (2023). "The impact of AlphaFold Protein Structure Database on the fields of life sciences." Proteomics. PMID: 36382391
Bertoni D et al. (2026). "AlphaFold Protein Structure Database 2025: a redesigned interface and updated structural coverage." Nucleic Acids Res. PMID: 41273079
Schaeffer RD et al. (2026). "ECOD: Classification of domains in AFDB Swiss-Prot structure predictions." PLoS Comput Biol. PMID: 41911251
Tettelin H et al. (2005). "Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial 'pan-genome'." PNAS. DOI: 10.1073/pnas.0506758102
Arkin AP et al. (2018). "KBase: The United States Department of Energy Systems Biology Knowledgebase." Nature Biotechnol. DOI: 10.1038/nbt.4163

Data Collections

📂

Kescience Alphafold

kescience_alphafold

KE Science

🧬

Pangenome Collection

kbase_ke_pangenome

KBase, DOE

📊

Fitness Browser

kescience_fitnessbrowser

Price Lab, LBNL

Review

Summary

This is an exceptionally well-executed project that makes a significant scientific contribution to understanding the bacterial annotation gap through a structural lens. The research successfully demonstrates that AlphaFold MSA depth serves as a strong proxy for functional annotation richness (Spearman ρ = 0.756), revealing a clear gradient where core genes have 2.9× higher median MSA depth than accessory genes. The discovery of 415,603 "paradox proteins"—core genes that are universally conserved yet structurally unprecedented—provides an actionable priority list for experimental structural characterization. The methodology is sound, the analysis is comprehensive, and the presentation is clear and well-structured.

Methodology

The research question is clearly stated and testable, addressing whether MSA depth predicts annotation richness and if structural novelty is systematically enriched in accessory versus core genes. The approach is scientifically rigorous, using four well-defined hypotheses (H1-H4) with appropriate statistical tests. Data sources are clearly identified with row counts and bridge coverage explicitly documented (29.3% of gene clusters successfully bridge to AlphaFold). The analysis correctly handles the complexity of joining massive tables across different databases (kbase_ke_pangenome × kescience_alphafold) using Spark. The methodology acknowledges important limitations including the 71% of clusters without AlphaFold entries and potential taxonomic sampling bias.

Code Quality

The SQL queries and Spark operations are well-structured and efficient. The notebooks correctly implement known best practices from docs/pitfalls.md—specifically avoiding .toPandas() on large groupBy results from interproscan_domains (833M rows), instead aggregating in Spark first then collecting manageable summaries. The statistical methods are appropriate: Mann-Whitney U comparisons (though performed on aggregates due to data size), chi-square tests for categorical associations, and Spearman correlation for the key H3 hypothesis. The bridge validation logic properly filters UniParc-only IDs (UPI prefix) that don't map to AlphaFold. Notebooks are well-organized with clear progression from data extraction → domain annotation → statistical analysis → paradox protein identification.

Findings Assessment

The conclusions are strongly supported by the data presented. H1 (core genes have higher MSA depth) is convincingly demonstrated with large effect sizes (2.9× median difference). H3 (MSA depth predicts domain richness) is quantitatively validated with a large-sample Spearman correlation. The paradox proteins discovery (H4) is particularly compelling—415,603 core clusters with MSA depth < 10 having 68.9% hypothetical rate versus 3.8% for all core genes. The apparent contradiction in H2 is thoughtfully resolved by recognizing two distinct annotation gap mechanisms. Limitations are honestly acknowledged, including the bias toward well-characterized organisms in the 29.3% bridged subset. The visualizations effectively communicate the main findings, and the literature context appropriately positions the work within pangenomics and structural biology.

Suggestions

Enhanced figure documentation: While the single three-panel figure effectively summarizes key results, additional visualizations would strengthen the presentation—particularly a scatter plot showing the H3 correlation and a phylogenetic distribution map of paradox proteins across bacterial families.
Statistical testing completeness: The Mann-Whitney U tests for H1 are performed on aggregate percentiles rather than raw distributions due to data size constraints. Consider providing exact p-values from a subset analysis to complement the effect size comparisons.
Paradox protein characterization: The top 1000 paradox proteins are ranked only by MSA depth. Consider incorporating conservation breadth (number of species clades) and fitness data where available to create a more sophisticated experimental priority score.
Cross-validation of MSA depth proxy: The strong correlation between MSA depth and domain richness suggests these metrics are nearly interchangeable. Testing whether this relationship holds equally across different taxonomic groups would strengthen the generalizability claims.
Future directions implementation: The proposed ESMFold comparison (orthogonal structural novelty signal) and fitness data integration are scientifically valuable extensions that would significantly enhance the impact.