06 Complementarity
Jupyter notebook from the Plant Microbiome Ecotypes project.
NB06: Microbe-Microbe Co-occurrence & Complementarity¶
Tests H3 (metabolic complementarity): do co-occurring plant-associated genera exhibit greater metabolic complementarity than random genus pairs?
Steps:
- Query NMDC
kraken_goldfor genus-level abundance profiles in soil/rhizosphere samples - Map NMDC genera to GTDB species from NB01
- Filter to soil/rhizosphere NMDC samples via
study_tableandomics_files_table - Identify co-occurring genera per sample (>1% relative abundance)
- Query GapMind pathway completeness for matched genera
- Compute pairwise metabolic complementarity scores
- PGP-pathogen co-occurrence/exclusion analysis (C-score)
- Permutation null model (1000 random genus-pair draws)
- BacDive metabolite utilization cross-validation
- Network visualization and output
Requires: Spark (on BERDL JupyterHub)
Outputs: data/complementarity_network.csv, figures/complementarity_heatmap.png, figures/guild_network.png
import os, re, warnings
from berdl_notebook_utils.setup_spark_session import get_spark_session
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from scipy.spatial.distance import pdist, squareform
from itertools import combinations
warnings.filterwarnings('ignore', category=FutureWarning)
_here = os.path.abspath('')
if os.path.basename(_here) == 'notebooks':
REPO = os.path.abspath(os.path.join(_here, '..', '..', '..'))
elif os.path.exists(os.path.join(_here, 'projects', 'plant_microbiome_ecotypes')):
REPO = _here
else:
REPO = os.path.abspath(os.path.join(_here, '..', '..', '..'))
PROJECT = os.path.join(REPO, 'projects', 'plant_microbiome_ecotypes')
DATA = os.path.join(PROJECT, 'data')
FIGURES = os.path.join(PROJECT, 'figures')
os.makedirs(DATA, exist_ok=True)
os.makedirs(FIGURES, exist_ok=True)
print(f'REPO: {REPO}')
print(f'DATA: {DATA}')
REPO: /home/aparkin/BERIL-research-observatory DATA: /home/aparkin/BERIL-research-observatory/projects/plant_microbiome_ecotypes/data
1. NMDC genus-level abundance profiles (Spark)¶
Query nmdc_arkin.kraken_gold for genus-level classifications.
Note: taxonomy_features is wide-format with numeric NCBI taxon-ID columns
and cannot be used for tidy-format joins (see docs/pitfalls.md).
Use kraken_gold instead -- tidy format with file_id, rank, name, abundance.
nmdc_genus_path = os.path.join(DATA, 'nmdc_genus_abundance.csv')
if os.path.exists(nmdc_genus_path):
nmdc_genus = pd.read_csv(nmdc_genus_path)
print(f'NMDC genus abundance (loaded from cache): {len(nmdc_genus):,} rows')
else:
spark = get_spark_session()
nmdc_genus = spark.sql("""
SELECT file_id, name AS taxon, abundance
FROM nmdc_arkin.kraken_gold
WHERE LOWER(rank) = 'genus'
AND abundance > 0.01
""").toPandas()
nmdc_genus.to_csv(nmdc_genus_path, index=False)
print(f'NMDC genus abundance: {len(nmdc_genus):,} rows')
print(f'Unique file_ids (samples): {nmdc_genus["file_id"].nunique():,}')
print(f'Unique genera: {nmdc_genus["taxon"].nunique():,}')
print(f'\nAbundance range: {nmdc_genus["abundance"].min():.4f} -- {nmdc_genus["abundance"].max():.4f}')
print(f'\nTop 10 most frequent genera:')
print(nmdc_genus['taxon'].value_counts().head(10).to_string())
NMDC genus abundance (loaded from cache): 49,279 rows Unique file_ids (samples): 3,579 Unique genera: 322 Abundance range: 0.0100 -- 0.9867 Top 10 most frequent genera: taxon Pseudomonas 3363 Streptomyces 3005 Burkholderia 2990 Bradyrhizobium 2955 Homo 2841 Mycobacterium 2118 Methylobacterium 2046 Rhodopseudomonas 1779 Micromonospora 1717 Mycolicibacterium 1675
2. Map NMDC genera to GTDB species¶
Load species_compartment.csv from NB01 (has genus column) and do a
simple genus-name match between NMDC taxon names and GTDB genera.
# Load NB01 species compartment data
species_comp = pd.read_csv(os.path.join(DATA, 'species_compartment.csv'))
print(f'Species compartment (NB01): {len(species_comp):,} species')
# Extract GTDB genus names (strip 'g__' prefix if present)
species_comp['genus_clean'] = species_comp['genus'].str.replace(r'^g__', '', regex=True)
gtdb_genera = set(species_comp['genus_clean'].dropna().unique())
print(f'Unique GTDB genera: {len(gtdb_genera):,}')
# Match NMDC taxon names to GTDB genus names
nmdc_genera = set(nmdc_genus['taxon'].unique())
matched_genera = nmdc_genera & gtdb_genera
print(f'\nNMDC genera: {len(nmdc_genera):,}')
print(f'Matched genera (NMDC intersect GTDB): {len(matched_genera):,}')
print(f'Match rate: {len(matched_genera) / len(nmdc_genera) * 100:.1f}%')
# Filter NMDC data to matched genera only
nmdc_matched = nmdc_genus[nmdc_genus['taxon'].isin(matched_genera)].copy()
print(f'\nNMDC rows after genus matching: {len(nmdc_matched):,} '
f'({len(nmdc_matched)/len(nmdc_genus)*100:.1f}% of original)')
Species compartment (NB01): 26,511 species Unique GTDB genera: 8,237 NMDC genera: 322 Matched genera (NMDC intersect GTDB): 260 Match rate: 80.7% NMDC rows after genus matching: 40,271 (81.7% of original)
3. Filter to soil/rhizosphere NMDC samples¶
Use omics_files_table to bridge file_id to study_id, then join
study_table for ecosystem metadata. Filter to samples where
ecosystem_type contains 'Soil' or ecosystem_subtype contains 'rhizosphere'.
# Get study metadata and file-to-study bridge from NMDC
soil_files_path = os.path.join(DATA, 'nmdc_soil_file_ids.csv')
if os.path.exists(soil_files_path):
soil_files = pd.read_csv(soil_files_path)
print(f'Soil/rhizosphere file_ids (loaded from cache): {len(soil_files):,}')
else:
spark = get_spark_session()
soil_files = spark.sql("""
SELECT DISTINCT o.file_id, o.sample_id, o.study_id,
s.ecosystem_category, s.ecosystem_type, s.ecosystem_subtype,
s.specific_ecosystem
FROM nmdc_arkin.omics_files_table o
JOIN nmdc_arkin.study_table s ON o.study_id = s.study_id
WHERE (
LOWER(s.ecosystem_type) LIKE '%soil%'
OR LOWER(s.ecosystem_subtype) LIKE '%rhizosph%'
OR LOWER(s.specific_ecosystem) LIKE '%rhizosph%'
OR LOWER(s.ecosystem_subtype) LIKE '%soil%'
)
AND o.file_id IS NOT NULL
""").toPandas()
soil_files.to_csv(soil_files_path, index=False)
print(f'Soil/rhizosphere file_ids: {len(soil_files):,}')
soil_file_ids = set(soil_files['file_id'].unique())
print(f'Unique soil/rhizosphere file_ids: {len(soil_file_ids):,}')
print(f'Unique sample_ids: {soil_files["sample_id"].nunique():,}')
print(f'\nEcosystem type distribution:')
print(soil_files['ecosystem_type'].value_counts(dropna=False).to_string())
# Filter NMDC genus data to soil/rhizosphere samples
nmdc_soil = nmdc_matched[nmdc_matched['file_id'].isin(soil_file_ids)].copy()
print(f'\nNMDC genus rows in soil/rhizosphere samples: {len(nmdc_soil):,}')
print(f'File_ids with genus data: {nmdc_soil["file_id"].nunique():,}')
print(f'Genera represented: {nmdc_soil["taxon"].nunique():,}')
Soil/rhizosphere file_ids (loaded from cache): 25,890 Unique soil/rhizosphere file_ids: 25,547 Unique sample_ids: 571 Ecosystem type distribution: ecosystem_type Soil 25890
NMDC genus rows in soil/rhizosphere samples: 4,345 File_ids with genus data: 348 Genera represented: 69
4. Build genus co-occurrence matrix¶
For each sample: identify genera at >1% relative abundance. Build a co-occurrence matrix counting how many samples share each genus pair.
# Identify co-occurring genera per sample (already filtered to >1% in query)
sample_genera = nmdc_soil.groupby('file_id')['taxon'].apply(set).reset_index()
sample_genera.columns = ['file_id', 'genera']
sample_genera['richness'] = sample_genera['genera'].apply(len)
# Filter to samples with at least 2 genera (needed for co-occurrence)
sample_genera = sample_genera[sample_genera['richness'] >= 2].copy()
print(f'Samples with >= 2 genera (>1% abundance): {len(sample_genera):,}')
print(f'Richness: mean={sample_genera["richness"].mean():.1f}, '
f'median={sample_genera["richness"].median():.0f}, '
f'max={sample_genera["richness"].max()}')
# Collect all genera present across samples
all_genera = sorted(set.union(*sample_genera['genera']))
genus_to_idx = {g: i for i, g in enumerate(all_genera)}
n_genera = len(all_genera)
print(f'Total genera in co-occurrence analysis: {n_genera}')
# Build co-occurrence count matrix
cooccur_matrix = np.zeros((n_genera, n_genera), dtype=int)
for _, row in sample_genera.iterrows():
genera_list = sorted(row['genera'])
for g1, g2 in combinations(genera_list, 2):
i, j = genus_to_idx[g1], genus_to_idx[g2]
cooccur_matrix[i, j] += 1
cooccur_matrix[j, i] += 1
# Also track how many samples each genus appears in
genus_sample_count = {}
for _, row in sample_genera.iterrows():
for g in row['genera']:
genus_sample_count[g] = genus_sample_count.get(g, 0) + 1
# Set diagonal to sample count
for g, count in genus_sample_count.items():
cooccur_matrix[genus_to_idx[g], genus_to_idx[g]] = count
print(f'\nCo-occurrence matrix: {cooccur_matrix.shape}')
print(f'Total pairwise co-occurrences: {(cooccur_matrix.sum() - np.diag(cooccur_matrix).sum()) // 2:,}')
# Top co-occurring pairs
pairs = []
for i in range(n_genera):
for j in range(i+1, n_genera):
if cooccur_matrix[i, j] > 0:
pairs.append((all_genera[i], all_genera[j], cooccur_matrix[i, j]))
pairs_df = pd.DataFrame(pairs, columns=['genus_A', 'genus_B', 'n_co_samples'])
pairs_df = pairs_df.sort_values('n_co_samples', ascending=False)
print(f'\nTotal genus pairs with co-occurrence: {len(pairs_df):,}')
print(f'\nTop 15 co-occurring pairs:')
print(pairs_df.head(15).to_string(index=False))
Samples with >= 2 genera (>1% abundance): 348 Richness: mean=12.5, median=12, max=20 Total genera in co-occurrence analysis: 69 Co-occurrence matrix: (69, 69) Total pairwise co-occurrences: 26,219
Total genus pairs with co-occurrence: 1,048
Top 15 co-occurring pairs:
genus_A genus_B n_co_samples
Bradyrhizobium Streptomyces 343
Bradyrhizobium Pseudomonas 331
Pseudomonas Streptomyces 328
Bradyrhizobium Burkholderia 311
Burkholderia Streptomyces 309
Burkholderia Pseudomonas 305
Mycobacterium Streptomyces 269
Bradyrhizobium Mycobacterium 269
Mycobacterium Pseudomonas 268
Bradyrhizobium Rhodopseudomonas 263
Burkholderia Mycobacterium 261
Rhodopseudomonas Streptomyces 261
Pseudomonas Rhodopseudomonas 258
Burkholderia Micromonospora 247
Micromonospora Pseudomonas 247
5. GapMind pathway complementarity¶
Aggregate GapMind pathway completeness from species to genus level.
Approach: Load the cached gapmind_plant_species.csv (species-level pathway
completeness from the Spark query in NB05) and species_compartment.csv (species-to-genus
mapping from NB01). Join on gtdb_species_clade_id, strip the g__ prefix from
genus names (to match NMDC genera), and aggregate: for each genus x pathway,
take MAX(complete) so that a genus "has" a pathway if any constituent species does.
gapmind_path = os.path.join(DATA, 'gapmind_genus_pathways.csv')
if os.path.exists(gapmind_path):
gapmind_genus = pd.read_csv(gapmind_path)
print(f'GapMind genus pathways (loaded from cache): {len(gapmind_genus):,} rows')
else:
# Load species-level GapMind data (cached from earlier Spark query)
gapmind_species = pd.read_csv(os.path.join(DATA, 'gapmind_plant_species.csv'))
print(f'GapMind species-level rows: {len(gapmind_species):,}')
# Load species-to-genus mapping from NB01
sp_comp = pd.read_csv(os.path.join(DATA, 'species_compartment.csv'),
usecols=['gtdb_species_clade_id', 'genus'])
# Join to get genus for each species
gapmind_with_genus = gapmind_species.merge(
sp_comp, on='gtdb_species_clade_id', how='inner'
)
print(f'After joining with species_compartment: {len(gapmind_with_genus):,} rows')
# Strip g__ prefix so genus names match NMDC plain names (e.g. 'Bacillus')
gapmind_with_genus['genus'] = (
gapmind_with_genus['genus'].str.replace(r'^g__', '', regex=True)
)
# Aggregate to genus level: MAX completeness per genus x pathway
gapmind_genus = (
gapmind_with_genus
.groupby(['genus', 'pathway', 'metabolic_category'], as_index=False)
.agg(complete=('complete', 'max'))
)
gapmind_genus.to_csv(gapmind_path, index=False)
print(f'GapMind genus pathways (aggregated): {len(gapmind_genus):,} rows')
# The genus column already has clean names (no g__ prefix).
# Create genus_clean as alias for downstream compatibility.
gapmind_genus['genus_clean'] = gapmind_genus['genus']
print(f'Unique genera in GapMind: {gapmind_genus["genus_clean"].nunique():,}')
print(f'Unique pathways: {gapmind_genus["pathway"].nunique():,}')
print(f'\nMetabolic categories:')
print(gapmind_genus['metabolic_category'].value_counts().to_string())
print(f'\nCompleteness rate: {gapmind_genus["complete"].mean()*100:.1f}% of genus-pathway pairs')
GapMind species-level rows: 2,213,340
After joining with species_compartment: 2,119,082 rows
GapMind genus pathways (aggregated): 658,712 rows Unique genera in GapMind: 8,237 Unique pathways: 80 Metabolic categories: metabolic_category carbon 510446 aa 148266 Completeness rate: 39.6% of genus-pathway pairs
# Build genus x pathway completeness matrix for co-occurring genera
cooccur_genera_set = set(all_genera)
gapmind_cooccur = gapmind_genus[gapmind_genus['genus_clean'].isin(cooccur_genera_set)].copy()
print(f'GapMind rows for co-occurring genera: {len(gapmind_cooccur):,}')
print(f'Genera with GapMind data: {gapmind_cooccur["genus_clean"].nunique():,} / {n_genera}')
# Pivot: genus x pathway -> complete (1/0)
pathway_matrix = gapmind_cooccur.pivot_table(
index='genus_clean', columns='pathway',
values='complete', fill_value=0
)
print(f'Pathway matrix: {pathway_matrix.shape}')
# Genera with GapMind data
gapmind_genera = set(pathway_matrix.index)
print(f'\nGenera available for complementarity analysis: {len(gapmind_genera)}')
GapMind rows for co-occurring genera: 5,520 Genera with GapMind data: 69 / 69 Pathway matrix: (69, 80) Genera available for complementarity analysis: 69
# Compute pairwise metabolic complementarity scores
# For each genus pair (A, B):
# complementarity = pathways where A is complete but B is not, plus vice versa
# These represent pathways where cross-feeding or metabolic cooperation could occur
complementarity_records = []
gapmind_genera_in_cooccur = sorted(gapmind_genera & cooccur_genera_set)
for g1, g2 in combinations(gapmind_genera_in_cooccur, 2):
p1 = pathway_matrix.loc[g1].values
p2 = pathway_matrix.loc[g2].values
# Asymmetric gaps: A has pathway, B does not (potential A->B provision)
a_to_b = int(((p1 == 1) & (p2 == 0)).sum())
b_to_a = int(((p2 == 1) & (p1 == 0)).sum())
# Total complementarity = symmetric sum of asymmetric gaps
complementarity = a_to_b + b_to_a
# Also compute Jaccard distance on pathway profiles
both_complete = int(((p1 == 1) & (p2 == 1)).sum())
either_complete = int(((p1 == 1) | (p2 == 1)).sum())
jaccard = 1 - (both_complete / either_complete) if either_complete > 0 else 0
# Get co-occurrence count
i1 = genus_to_idx.get(g1)
i2 = genus_to_idx.get(g2)
n_co = cooccur_matrix[i1, i2] if (i1 is not None and i2 is not None) else 0
complementarity_records.append({
'genus_A': g1, 'genus_B': g2,
'a_to_b_gaps': a_to_b, 'b_to_a_gaps': b_to_a,
'complementarity': complementarity,
'jaccard_dist': jaccard,
'both_complete': both_complete,
'either_complete': either_complete,
'n_co_samples': n_co
})
comp_df = pd.DataFrame(complementarity_records)
print(f'Total genus pairs with complementarity scores: {len(comp_df):,}')
print(f'\nComplementarity score distribution:')
print(comp_df['complementarity'].describe().to_string())
# Separate co-occurring vs non-co-occurring pairs
cooccur_pairs = comp_df[comp_df['n_co_samples'] > 0].copy()
non_cooccur_pairs = comp_df[comp_df['n_co_samples'] == 0].copy()
print(f'\nCo-occurring pairs: {len(cooccur_pairs):,}')
print(f'Non-co-occurring pairs: {len(non_cooccur_pairs):,}')
if len(cooccur_pairs) > 0 and len(non_cooccur_pairs) > 0:
stat, pval = stats.mannwhitneyu(
cooccur_pairs['complementarity'], non_cooccur_pairs['complementarity'],
alternative='greater'
)
print(f'\nMann-Whitney U test (co-occurring > non-co-occurring):')
print(f' Co-occurring mean complementarity: {cooccur_pairs["complementarity"].mean():.2f}')
print(f' Non-co-occurring mean complementarity: {non_cooccur_pairs["complementarity"].mean():.2f}')
print(f' U = {stat:.0f}, p = {pval:.2e}')
Total genus pairs with complementarity scores: 2,346 Complementarity score distribution: count 2346.000000 mean 12.739130 std 8.504216 min 0.000000 25% 6.000000 50% 11.000000 75% 18.000000 max 43.000000 Co-occurring pairs: 1,048 Non-co-occurring pairs: 1,298 Mann-Whitney U test (co-occurring > non-co-occurring): Co-occurring mean complementarity: 10.72 Non-co-occurring mean complementarity: 14.37 U = 495024, p = 1.00e+00
6. PGP-pathogen co-occurrence/exclusion analysis¶
From NB02 species_cohort_markers.csv, classify genera as PGP-dominant,
pathogen-dominant, dual, or neutral. Then test whether PGP and pathogen
genera co-occur or exclude each other in NMDC communities using a C-score
(checkerboard score) approach.
# Load NB02 cohort assignments
cohort_path = os.path.join(DATA, 'species_cohort_markers.csv')
if os.path.exists(cohort_path):
cohort_df = pd.read_csv(cohort_path)
print(f'Species cohort markers (NB02): {len(cohort_df):,} species')
else:
print('WARNING: species_cohort_markers.csv not found. Run NB02 first.')
cohort_df = pd.DataFrame()
if len(cohort_df) > 0:
# Aggregate cohort to genus level: majority vote
cohort_df['genus_clean'] = cohort_df['genus'].str.replace(r'^g__', '', regex=True)
genus_cohort = cohort_df.groupby('genus_clean')['marker_cohort'].agg(
lambda x: x.value_counts().index[0]
).reset_index()
genus_cohort.columns = ['genus', 'guild']
# Recode for clarity
guild_map = {
'pgp_only': 'pgp_dominant',
'pathogen_only': 'pathogen_dominant',
'dual_nature': 'dual',
'neutral': 'neutral'
}
genus_cohort['guild'] = genus_cohort['guild'].map(guild_map).fillna('neutral')
print(f'\nGenus-level guild distribution:')
print(genus_cohort['guild'].value_counts().to_string())
# Map guilds to co-occurring genera
guild_lookup = dict(zip(genus_cohort['genus'], genus_cohort['guild']))
print(f'\nGenera with guild assignments: {len(guild_lookup):,}')
print(f'Of which appear in NMDC co-occurrence: '
f'{len(set(guild_lookup.keys()) & cooccur_genera_set):,}')
Species cohort markers (NB02): 25,660 species
Genus-level guild distribution: guild dual 3844 pathogen_dominant 2940 pgp_dominant 451 neutral 320 Genera with guild assignments: 7,555 Of which appear in NMDC co-occurrence: 69
# C-score (checkerboard score) analysis for PGP vs pathogen genera
# The C-score measures the tendency of species pairs to NOT co-occur.
# C_ij = (r_i - S_ij) * (r_j - S_ij) where r = row total, S = co-occurrence count
# Higher C-score = more exclusion (fewer shared sites than expected)
if len(cohort_df) > 0:
# Identify PGP and pathogen genera in our data
pgp_genera = [g for g in all_genera if guild_lookup.get(g) == 'pgp_dominant']
path_genera = [g for g in all_genera if guild_lookup.get(g) == 'pathogen_dominant']
dual_genera = [g for g in all_genera if guild_lookup.get(g) == 'dual']
neutral_genera = [g for g in all_genera if guild_lookup.get(g, 'neutral') == 'neutral']
print(f'PGP genera in NMDC data: {len(pgp_genera)}')
print(f'Pathogen genera in NMDC data: {len(path_genera)}')
print(f'Dual genera: {len(dual_genera)}')
print(f'Neutral genera: {len(neutral_genera)}')
# Compute C-score for PGP-pathogen pairs
n_samples_total = len(sample_genera)
c_scores_pgp_path = []
c_scores_within_pgp = []
c_scores_within_path = []
for g_pgp in pgp_genera:
for g_path in path_genera:
i1 = genus_to_idx.get(g_pgp)
i2 = genus_to_idx.get(g_path)
if i1 is not None and i2 is not None:
r_i = genus_sample_count.get(g_pgp, 0)
r_j = genus_sample_count.get(g_path, 0)
s_ij = cooccur_matrix[i1, i2]
c_score = (r_i - s_ij) * (r_j - s_ij)
c_scores_pgp_path.append(c_score)
# Within-PGP C-scores
for g1, g2 in combinations(pgp_genera, 2):
i1, i2 = genus_to_idx.get(g1), genus_to_idx.get(g2)
if i1 is not None and i2 is not None:
r_i = genus_sample_count.get(g1, 0)
r_j = genus_sample_count.get(g2, 0)
s_ij = cooccur_matrix[i1, i2]
c_scores_within_pgp.append((r_i - s_ij) * (r_j - s_ij))
# Within-pathogen C-scores
for g1, g2 in combinations(path_genera, 2):
i1, i2 = genus_to_idx.get(g1), genus_to_idx.get(g2)
if i1 is not None and i2 is not None:
r_i = genus_sample_count.get(g1, 0)
r_j = genus_sample_count.get(g2, 0)
s_ij = cooccur_matrix[i1, i2]
c_scores_within_path.append((r_i - s_ij) * (r_j - s_ij))
print(f'\n=== C-score Analysis ===')
if c_scores_pgp_path:
print(f'PGP-Pathogen pairs: {len(c_scores_pgp_path)}, '
f'mean C-score = {np.mean(c_scores_pgp_path):.1f}')
if c_scores_within_pgp:
print(f'Within-PGP pairs: {len(c_scores_within_pgp)}, '
f'mean C-score = {np.mean(c_scores_within_pgp):.1f}')
if c_scores_within_path:
print(f'Within-Pathogen pairs: {len(c_scores_within_path)}, '
f'mean C-score = {np.mean(c_scores_within_path):.1f}')
# Test: PGP-pathogen exclusion vs within-guild co-occurrence
within_guild = c_scores_within_pgp + c_scores_within_path
if c_scores_pgp_path and within_guild:
stat_c, pval_c = stats.mannwhitneyu(
c_scores_pgp_path, within_guild, alternative='greater'
)
print(f'\nMann-Whitney U (PGP-pathogen > within-guild C-score):')
print(f' U = {stat_c:.0f}, p = {pval_c:.2e}')
if pval_c < 0.05:
print(' --> PGP and pathogen genera show significant exclusion')
else:
print(' --> No significant exclusion between PGP and pathogen genera')
else:
print('Skipping C-score analysis (no cohort data).')
guild_lookup = {}
pgp_genera = []
path_genera = []
dual_genera = []
neutral_genera = []
PGP genera in NMDC data: 0 Pathogen genera in NMDC data: 2 Dual genera: 67 Neutral genera: 0 === C-score Analysis === Within-Pathogen pairs: 1, mean C-score = 156.0
7. Permutation null model (1000 random genus-pair draws)¶
To test whether co-occurring genera have higher complementarity than expected by chance, we draw 1000 random genus pairs from the same pool (preserving the set of genera that appear in NMDC data with GapMind coverage) and compare observed complementarity to the null distribution.
# Permutation null model
np.random.seed(42)
n_permutations = 1000
genera_with_gapmind = sorted(gapmind_genera & cooccur_genera_set)
n_gw = len(genera_with_gapmind)
# Observed complementarity for co-occurring pairs (those with both GapMind data)
observed_comp = cooccur_pairs[
cooccur_pairs['genus_A'].isin(gapmind_genera) &
cooccur_pairs['genus_B'].isin(gapmind_genera)
]['complementarity'].values
print(f'Observed co-occurring pairs with GapMind data: {len(observed_comp)}')
if len(observed_comp) > 0:
obs_mean = observed_comp.mean()
print(f'Observed mean complementarity: {obs_mean:.2f}')
# Null distribution: random genus pairs
null_means = []
n_pairs_to_draw = len(observed_comp)
for _ in range(n_permutations):
# Draw random pairs
idx1 = np.random.choice(n_gw, size=n_pairs_to_draw, replace=True)
idx2 = np.random.choice(n_gw, size=n_pairs_to_draw, replace=True)
perm_comp = []
for i1, i2 in zip(idx1, idx2):
if i1 == i2:
continue
g1 = genera_with_gapmind[i1]
g2 = genera_with_gapmind[i2]
p1 = pathway_matrix.loc[g1].values
p2 = pathway_matrix.loc[g2].values
a_to_b = int(((p1 == 1) & (p2 == 0)).sum())
b_to_a = int(((p2 == 1) & (p1 == 0)).sum())
perm_comp.append(a_to_b + b_to_a)
if perm_comp:
null_means.append(np.mean(perm_comp))
null_means = np.array(null_means)
p_perm = (null_means >= obs_mean).sum() / len(null_means)
print(f'\nNull mean complementarity: {null_means.mean():.2f} +/- {null_means.std():.2f}')
print(f'Observed mean: {obs_mean:.2f}')
print(f'Permutation p-value (one-sided, observed >= null): {p_perm:.4f}')
if p_perm < 0.05:
print('\n>>> H3 SUPPORTED: Co-occurring genera have significantly higher '
'metabolic complementarity than random genus pairs.')
else:
print('\n>>> H3 NOT SUPPORTED: No significant difference in complementarity '
'between co-occurring and random genus pairs.')
# Effect size
if null_means.std() > 0:
cohens_d = (obs_mean - null_means.mean()) / null_means.std()
print(f'Effect size (Cohen\'s d): {cohens_d:.2f}')
else:
print('No co-occurring pairs with GapMind data for permutation test.')
p_perm = np.nan
null_means = np.array([])
Observed co-occurring pairs with GapMind data: 1048 Observed mean complementarity: 10.72
Null mean complementarity: 12.74 +/- 0.27 Observed mean: 10.72 Permutation p-value (one-sided, observed >= null): 1.0000 >>> H3 NOT SUPPORTED: No significant difference in complementarity between co-occurring and random genus pairs. Effect size (Cohen's d): -7.54
# Visualization: observed vs null distribution
if len(null_means) > 0 and len(observed_comp) > 0:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: null distribution with observed value
ax = axes[0]
ax.hist(null_means, bins=40, color='#90A4AE', edgecolor='white', alpha=0.8,
label='Null distribution')
ax.axvline(obs_mean, color='#F44336', linewidth=2, linestyle='--',
label=f'Observed mean = {obs_mean:.2f}')
ax.set_xlabel('Mean complementarity score')
ax.set_ylabel('Count (permutations)')
ax.set_title(f'Permutation Test (n={n_permutations}, p={p_perm:.4f})')
ax.legend(fontsize=9)
# Right: complementarity vs co-occurrence frequency
ax = axes[1]
plot_pairs = comp_df[
comp_df['genus_A'].isin(gapmind_genera) &
comp_df['genus_B'].isin(gapmind_genera)
].copy()
cooccur_mask = plot_pairs['n_co_samples'] > 0
ax.scatter(
plot_pairs.loc[~cooccur_mask, 'n_co_samples'],
plot_pairs.loc[~cooccur_mask, 'complementarity'],
alpha=0.15, s=8, color='#BDBDBD', label='Non-co-occurring', rasterized=True
)
ax.scatter(
plot_pairs.loc[cooccur_mask, 'n_co_samples'],
plot_pairs.loc[cooccur_mask, 'complementarity'],
alpha=0.4, s=15, color='#4CAF50', label='Co-occurring', rasterized=True
)
ax.set_xlabel('Number of co-occurring samples')
ax.set_ylabel('Complementarity score')
ax.set_title('Complementarity vs. co-occurrence')
ax.legend(fontsize=9)
plt.tight_layout()
plt.savefig(os.path.join(FIGURES, 'complementarity_heatmap.png'),
dpi=150, bbox_inches='tight')
plt.show()
print('Saved figures/complementarity_heatmap.png')
else:
print('Insufficient data for complementarity visualization.')
Saved figures/complementarity_heatmap.png
8. BacDive metabolite utilization cross-validation¶
For genera involved in complementary pairs, query BacDive
metabolite_utilization to check if predicted pathway gaps align
with known metabolite utilization data.
Pitfall: BacDive utilization has four values: +, -, produced, +/-.
Filter to explicit +/- tests only (see docs/pitfalls.md).
bacdive_met_path = os.path.join(DATA, 'bacdive_metabolites_genera.csv')
if os.path.exists(bacdive_met_path):
bacdive_met = pd.read_csv(bacdive_met_path)
print(f'BacDive metabolite utilization (loaded from cache): {len(bacdive_met):,} rows')
else:
spark = get_spark_session()
# Query BacDive metabolite utilization joined with taxonomy for genus
bacdive_met = spark.sql("""
SELECT t.genus, mu.compound_name, mu.chebi_id, mu.utilization,
COUNT(*) AS n_tests
FROM kescience_bacdive.metabolite_utilization mu
JOIN kescience_bacdive.taxonomy t ON mu.bacdive_id = t.bacdive_id
WHERE mu.utilization IN ('+', '-')
AND t.genus IS NOT NULL
GROUP BY t.genus, mu.compound_name, mu.chebi_id, mu.utilization
""").toPandas()
bacdive_met.to_csv(bacdive_met_path, index=False)
print(f'BacDive metabolite utilization: {len(bacdive_met):,} rows')
print(f'Unique genera in BacDive: {bacdive_met["genus"].nunique():,}')
print(f'Unique compounds: {bacdive_met["compound_name"].nunique():,}')
print(f'\nUtilization counts: {bacdive_met["utilization"].value_counts().to_dict()}')
BacDive metabolite utilization: 137,599 rows
Unique genera in BacDive: 2,489
Unique compounds: 953
Utilization counts: {'-': 77430, '+': 60169}
# Cross-validate: for top complementary co-occurring pairs,
# check if genera with pathway gaps also lack BacDive utilization
# for related metabolites
# Map GapMind pathway names to BacDive compound names (approximate matching)
# GapMind pathways are amino acids, carbon sources; BacDive tests compounds
pathway_to_compound = {
'glucose': ['D-Glucose', 'glucose'],
'glucosamine': ['D-Glucosamine', 'glucosamine'],
'arg': ['L-Arginine', 'arginine'],
'his': ['L-Histidine', 'histidine'],
'trp': ['L-Tryptophan', 'tryptophan'],
'leu': ['L-Leucine', 'leucine'],
'ile': ['L-Isoleucine', 'isoleucine'],
'val': ['L-Valine', 'valine'],
'lys': ['L-Lysine', 'lysine'],
'phe': ['L-Phenylalanine', 'phenylalanine'],
'pro': ['L-Proline', 'proline'],
'ser': ['L-Serine', 'serine'],
'thr': ['L-Threonine', 'threonine'],
'gly': ['glycine'],
'cys': ['L-Cysteine', 'cysteine'],
'met': ['L-Methionine', 'methionine'],
'gln': ['L-Glutamine', 'glutamine'],
'asn': ['L-Asparagine', 'asparagine'],
'tyr': ['L-Tyrosine', 'tyrosine'],
'galactose': ['D-Galactose', 'galactose'],
'lactose': ['lactose'],
'trehalose': ['trehalose', 'D,D-Trehalose'],
'sucrose': ['sucrose'],
'maltose': ['maltose'],
'mannose': ['D-Mannose', 'mannose'],
'xylose': ['D-Xylose', 'xylose'],
'arabinose': ['L-Arabinose', 'arabinose'],
'citrate': ['citrate', 'citric acid'],
'pyruvate': ['pyruvate', 'pyruvic acid'],
}
# Build genus -> compound -> utilization lookup from BacDive
bd_positive = bacdive_met[bacdive_met['utilization'] == '+'].copy()
bd_genus_compounds = bd_positive.groupby('genus')['compound_name'].apply(set).to_dict()
# For top 50 complementary co-occurring pairs, check alignment
top_comp_pairs = cooccur_pairs.nlargest(50, 'complementarity')
validation_records = []
for _, pair_row in top_comp_pairs.iterrows():
g1, g2 = pair_row['genus_A'], pair_row['genus_B']
if g1 not in pathway_matrix.index or g2 not in pathway_matrix.index:
continue
p1 = pathway_matrix.loc[g1]
p2 = pathway_matrix.loc[g2]
# Find pathways where g1 has but g2 lacks
gaps_g2 = p1.index[(p1 == 1) & (p2 == 0)].tolist()
for pathway in gaps_g2:
compounds = pathway_to_compound.get(pathway, [])
if not compounds:
continue
# Check if g2 is known to NOT utilize these compounds in BacDive
bd_g2 = bd_genus_compounds.get(g2, set())
for cpd in compounds:
cpd_lower = cpd.lower()
# Check if genus can utilize (case-insensitive match)
can_use = any(cpd_lower in c.lower() for c in bd_g2)
validation_records.append({
'genus_A': g1, 'genus_B': g2,
'pathway': pathway, 'compound': cpd,
'g2_has_pathway': 0,
'g2_can_utilize_bacdive': int(can_use),
'consistent': int(not can_use) # gap + no utilization = consistent
})
if validation_records:
valid_df = pd.DataFrame(validation_records)
consistency_rate = valid_df['consistent'].mean()
print(f'BacDive cross-validation records: {len(valid_df)}')
print(f'Consistency rate (GapMind gap aligns with no BacDive utilization): '
f'{consistency_rate*100:.1f}%')
print(f'\nBreakdown by pathway:')
pathway_consistency = valid_df.groupby('pathway')['consistent'].agg(['mean', 'count'])
pathway_consistency.columns = ['consistency_rate', 'n_tests']
pathway_consistency = pathway_consistency.sort_values('n_tests', ascending=False)
print(pathway_consistency.head(15).to_string())
else:
print('No BacDive cross-validation records generated.')
valid_df = pd.DataFrame()
BacDive cross-validation records: 373
Consistency rate (GapMind gap aligns with no BacDive utilization): 83.1%
Breakdown by pathway:
consistency_rate n_tests
pathway
pyruvate 0.884615 26
his 1.000000 26
mannose 0.458333 24
citrate 1.000000 22
arabinose 0.636364 22
arg 0.863636 22
trp 0.850000 20
val 1.000000 20
xylose 0.600000 20
phe 1.000000 20
leu 1.000000 20
glucosamine 0.944444 18
ser 1.000000 16
galactose 0.937500 16
pro 0.125000 16
9. Network visualization of top complementarity relationships¶
Nodes = genera, colored by guild (PGP / pathogen / dual / neutral). Edges = complementarity score (width proportional to score). Only show top relationships for readability.
# Network visualization using matplotlib (no networkx dependency required)
# Select top 30 co-occurring pairs by complementarity for visualization
top_n = 30
top_edges = cooccur_pairs.nlargest(top_n, 'complementarity')
if len(top_edges) > 0:
# Collect unique genera in top edges
net_genera = sorted(set(top_edges['genus_A']) | set(top_edges['genus_B']))
n_nodes = len(net_genera)
genus_idx_net = {g: i for i, g in enumerate(net_genera)}
# Layout: circular
angles = np.linspace(0, 2 * np.pi, n_nodes, endpoint=False)
radius = 4.0
pos_x = radius * np.cos(angles)
pos_y = radius * np.sin(angles)
# Guild colors
guild_colors = {
'pgp_dominant': '#4CAF50',
'pathogen_dominant': '#F44336',
'dual': '#FF9800',
'neutral': '#90A4AE'
}
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
# Draw edges
max_comp = top_edges['complementarity'].max()
for _, row in top_edges.iterrows():
i1 = genus_idx_net[row['genus_A']]
i2 = genus_idx_net[row['genus_B']]
lw = 0.5 + 3.0 * (row['complementarity'] / max_comp)
alpha = 0.3 + 0.5 * (row['complementarity'] / max_comp)
ax.plot([pos_x[i1], pos_x[i2]], [pos_y[i1], pos_y[i2]],
color='#78909C', linewidth=lw, alpha=alpha, zorder=1)
# Draw nodes
for g in net_genera:
idx = genus_idx_net[g]
guild = guild_lookup.get(g, 'neutral')
color = guild_colors.get(guild, '#90A4AE')
node_size = genus_sample_count.get(g, 1)
# Scale node size
s = 80 + 15 * min(node_size, 50)
ax.scatter(pos_x[idx], pos_y[idx], s=s, c=color,
edgecolors='white', linewidth=1.5, zorder=2)
# Label
label_radius = radius + 0.6
lx = label_radius * np.cos(angles[idx])
ly = label_radius * np.sin(angles[idx])
ha = 'left' if np.cos(angles[idx]) >= 0 else 'right'
rotation = np.degrees(angles[idx])
if rotation > 90 and rotation < 270:
rotation += 180
ax.text(lx, ly, g, fontsize=7, ha=ha, va='center',
rotation=rotation, rotation_mode='anchor')
# Legend
for guild_name, color in guild_colors.items():
ax.scatter([], [], c=color, s=100, label=guild_name.replace('_', ' ').title())
ax.legend(loc='upper left', fontsize=9, title='Guild', framealpha=0.9)
ax.set_xlim(-radius - 2.5, radius + 2.5)
ax.set_ylim(-radius - 2.5, radius + 2.5)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title(f'Top {top_n} Complementarity Relationships\n'
f'(edge width ~ complementarity score, node size ~ NMDC prevalence)',
fontsize=12)
plt.tight_layout()
plt.savefig(os.path.join(FIGURES, 'guild_network.png'),
dpi=150, bbox_inches='tight')
plt.show()
print('Saved figures/guild_network.png')
else:
print('No co-occurring pairs with complementarity data for network plot.')
Saved figures/guild_network.png
10. Complementarity heatmap for top genera¶
# Heatmap: pathway completeness for top complementary genera
if len(top_edges) > 0 and len(pathway_matrix) > 0:
heatmap_genera = sorted(set(top_edges['genus_A']) | set(top_edges['genus_B']))
heatmap_genera = [g for g in heatmap_genera if g in pathway_matrix.index]
if len(heatmap_genera) >= 2:
# Select pathways with variation across these genera
sub_matrix = pathway_matrix.loc[heatmap_genera]
pathway_var = sub_matrix.var()
variable_pathways = pathway_var[pathway_var > 0].nlargest(30).index.tolist()
if variable_pathways:
plot_data = sub_matrix[variable_pathways]
# Add guild annotation for row colors
row_colors = pd.Series(
{g: guild_colors.get(guild_lookup.get(g, 'neutral'), '#90A4AE')
for g in heatmap_genera},
name='Guild'
)
fig_height = max(6, len(heatmap_genera) * 0.35)
fig_width = max(10, len(variable_pathways) * 0.35)
g = sns.clustermap(
plot_data.astype(float),
cmap='YlGn', vmin=0, vmax=1,
row_colors=row_colors,
figsize=(fig_width, fig_height),
xticklabels=True, yticklabels=True,
linewidths=0.3, linecolor='white',
cbar_kws={'label': 'Pathway complete (1) / incomplete (0)'},
dendrogram_ratio=(0.12, 0.08)
)
g.ax_heatmap.set_xlabel('GapMind Pathway', fontsize=10)
g.ax_heatmap.set_ylabel('Genus', fontsize=10)
g.ax_heatmap.tick_params(labelsize=7)
g.fig.suptitle('Pathway Completeness: Top Complementary Co-occurring Genera',
y=1.02, fontsize=12)
plt.savefig(os.path.join(FIGURES, 'complementarity_heatmap_detail.png'),
dpi=150, bbox_inches='tight')
plt.show()
print('Saved figures/complementarity_heatmap_detail.png')
else:
print('No variable pathways found for heatmap.')
else:
print('Fewer than 2 genera with GapMind data for heatmap.')
else:
print('Insufficient data for heatmap.')
Saved figures/complementarity_heatmap_detail.png
11. Save outputs¶
# Save complementarity network
# Add guild annotations to comp_df
comp_df['guild_A'] = comp_df['genus_A'].map(guild_lookup).fillna('neutral')
comp_df['guild_B'] = comp_df['genus_B'].map(guild_lookup).fillna('neutral')
comp_df.to_csv(os.path.join(DATA, 'complementarity_network.csv'), index=False)
print('=== NB06 Summary ===')
print(f'NMDC genus profiles: {nmdc_genus["file_id"].nunique():,} samples, '
f'{nmdc_genus["taxon"].nunique():,} genera')
print(f'Soil/rhizosphere samples: {nmdc_soil["file_id"].nunique():,}')
print(f'Genera in co-occurrence analysis: {n_genera}')
print(f'Co-occurring pairs (with GapMind): {len(cooccur_pairs):,}')
if len(observed_comp) > 0:
print(f'Mean complementarity (co-occurring): {obs_mean:.2f}')
print(f'Mean complementarity (null): {null_means.mean():.2f}')
print(f'Permutation p-value: {p_perm:.4f}')
if len(c_scores_pgp_path) > 0:
print(f'PGP-pathogen C-score: {np.mean(c_scores_pgp_path):.1f}')
if len(valid_df) > 0:
print(f'BacDive cross-validation consistency: {valid_df["consistent"].mean()*100:.1f}%')
print(f'\nOutputs saved to {DATA}/')
print(' - complementarity_network.csv')
print(f'\nFigures saved to {FIGURES}/')
print(' - complementarity_heatmap.png')
print(' - complementarity_heatmap_detail.png')
print(' - guild_network.png')
print(f'\nReady for NB07 (synthesis)')
=== NB06 Summary === NMDC genus profiles: 3,579 samples, 322 genera Soil/rhizosphere samples: 348 Genera in co-occurrence analysis: 69 Co-occurring pairs (with GapMind): 1,048 Mean complementarity (co-occurring): 10.72 Mean complementarity (null): 12.74 Permutation p-value: 1.0000 BacDive cross-validation consistency: 83.1% Outputs saved to /home/aparkin/BERIL-research-observatory/projects/plant_microbiome_ecotypes/data/ - complementarity_network.csv Figures saved to /home/aparkin/BERIL-research-observatory/projects/plant_microbiome_ecotypes/figures/ - complementarity_heatmap.png - complementarity_heatmap_detail.png - guild_network.png Ready for NB07 (synthesis)