Nb03 Differential Enrichment
Jupyter notebook from the Ecotype Functional Differentiation project.
NB03: Differential Enrichment Analysis¶
Goal: Test whether COG functional categories differ between ecotypes within species, and whether adaptive categories differentiate more than housekeeping categories.
Hypothesis:
- H0: Ecotypes have indistinguishable COG profiles — gene content variation is functionally random
- H1: Ecotypes differ in adaptive COG categories (V, P, G, E, Q, M, K) but NOT in core metabolic categories (J, F, H, C)
Approach:
- For each species with ≥2 ecotypes: chi-square test per COG category
- BH-FDR correction across all species × COG combinations
- Aggregate: which COG categories are most frequently differentiated?
- Test H1: compare differentiation rates of adaptive vs housekeeping categories
Inputs: data/ecotype_assignments.csv, data/ecotype_cog_profiles.csv
Outputs: Figures in figures/, summary statistics
In [1]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency, fisher_exact, mannwhitneyu
from statsmodels.stats.multitest import multipletests
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('Agg')
import warnings
warnings.filterwarnings('ignore')
1. Load data¶
In [2]:
ecotypes = pd.read_csv('../data/ecotype_assignments.csv')
cog_profiles = pd.read_csv('../data/ecotype_cog_profiles.csv')
clustering_stats = pd.read_csv('../data/clustering_stats.csv')
print(f"Ecotype assignments: {len(ecotypes)} genomes across {ecotypes['species'].nunique()} species")
print(f"COG profiles: {len(cog_profiles)} entries across {cog_profiles['species'].nunique()} species")
print(f"\nClustering stats:")
print(f" Mean silhouette: {clustering_stats['silhouette'].mean():.3f}")
print(f" Median silhouette: {clustering_stats['silhouette'].median():.3f}")
print(f" Mean clusters/species: {clustering_stats['n_clusters'].mean():.1f}")
Ecotype assignments: 1820 genomes across 12 species COG profiles: 894 entries across 12 species Clustering stats: Mean silhouette: 0.215 Median silhouette: 0.174 Mean clusters/species: 3.7
2. COG category definitions¶
In [3]:
# COG category descriptions
COG_NAMES = {
'J': 'Translation', 'A': 'RNA processing', 'K': 'Transcription',
'L': 'Replication/repair', 'B': 'Chromatin', 'D': 'Cell cycle',
'Y': 'Nuclear structure', 'V': 'Defense', 'T': 'Signal transduction',
'M': 'Cell wall', 'N': 'Cell motility', 'Z': 'Cytoskeleton',
'W': 'Extracellular', 'U': 'Secretion', 'O': 'PTM/chaperones',
'X': 'Mobilome', 'C': 'Energy production', 'G': 'Carbohydrate metabolism',
'E': 'Amino acid metabolism', 'F': 'Nucleotide metabolism',
'H': 'Coenzyme metabolism', 'I': 'Lipid metabolism',
'P': 'Inorganic ion transport', 'Q': 'Secondary metabolites',
'R': 'General prediction', 'S': 'Unknown function'
}
# Hypothesis categories
ADAPTIVE = {'V', 'P', 'G', 'E', 'Q', 'M', 'K'} # Expected to differ between ecotypes
HOUSEKEEPING = {'J', 'F', 'H', 'C'} # Expected to be shared
print(f"Adaptive categories ({len(ADAPTIVE)}): {sorted(ADAPTIVE)}")
print(f" = {', '.join(COG_NAMES[c] for c in sorted(ADAPTIVE))}")
print(f"\nHousekeeping categories ({len(HOUSEKEEPING)}): {sorted(HOUSEKEEPING)}")
print(f" = {', '.join(COG_NAMES[c] for c in sorted(HOUSEKEEPING))}")
Adaptive categories (7): ['E', 'G', 'K', 'M', 'P', 'Q', 'V'] = Amino acid metabolism, Carbohydrate metabolism, Transcription, Cell wall, Inorganic ion transport, Secondary metabolites, Defense Housekeeping categories (4): ['C', 'F', 'H', 'J'] = Energy production, Nucleotide metabolism, Coenzyme metabolism, Translation
3. Per-species COG differentiation tests¶
For each species with ≥2 ecotypes, test whether each COG category's proportion differs between ecotypes using chi-square tests.
In [4]:
test_results = []
species_list = cog_profiles['species'].unique()
for sp in species_list:
sp_data = cog_profiles[cog_profiles['species'] == sp]
ecotype_ids = sorted(sp_data['ecotype'].unique())
if len(ecotype_ids) < 2:
continue
# Get all COG categories present in this species
cog_cats = sorted(sp_data['COG_category'].unique())
for cog in cog_cats:
# Build contingency table: ecotype × (has_cog, not_has_cog)
counts = []
for ec in ecotype_ids:
ec_data = sp_data[sp_data['ecotype'] == ec]
cog_row = ec_data[ec_data['COG_category'] == cog]
if len(cog_row) > 0:
count = cog_row['count'].values[0]
total = cog_row['total'].values[0]
else:
count = 0
total = ec_data['total'].values[0] if len(ec_data) > 0 else 0
counts.append((count, max(total - count, 0)))
contingency = np.array(counts)
# Skip if all zeros in one column
if contingency[:, 0].sum() == 0 or contingency[:, 1].sum() == 0:
continue
# Chi-square test (or Fisher's exact for 2×2 with small counts)
try:
if len(ecotype_ids) == 2 and contingency.min() < 5:
_, p_val = fisher_exact(contingency)
test_type = 'fisher'
else:
chi2, p_val, dof, expected = chi2_contingency(contingency)
test_type = 'chi2'
except ValueError:
continue
# Effect size: max absolute difference in proportions
fractions = []
for ec in ecotype_ids:
ec_data = sp_data[sp_data['ecotype'] == ec]
cog_row = ec_data[ec_data['COG_category'] == cog]
frac = cog_row['fraction'].values[0] if len(cog_row) > 0 else 0.0
fractions.append(frac)
max_diff = max(fractions) - min(fractions)
test_results.append({
'species': sp,
'short_name': sp.split('--')[0].replace('s__', ''),
'COG_category': cog,
'COG_name': COG_NAMES.get(cog, 'Unknown'),
'p_value': p_val,
'test_type': test_type,
'max_frac_diff': max_diff,
'n_ecotypes': len(ecotype_ids),
'is_adaptive': cog in ADAPTIVE,
'is_housekeeping': cog in HOUSEKEEPING,
})
results_df = pd.DataFrame(test_results)
print(f"Total tests: {len(results_df)}")
print(f"Species tested: {results_df['species'].nunique()}")
print(f"COG categories tested: {results_df['COG_category'].nunique()}")
Total tests: 257 Species tested: 12 COG categories tested: 23
4. Multiple testing correction (BH-FDR)¶
In [5]:
if len(results_df) > 0:
# BH-FDR correction
reject, q_values, _, _ = multipletests(results_df['p_value'], method='fdr_bh', alpha=0.05)
results_df['q_value'] = q_values
results_df['significant'] = reject
n_sig = results_df['significant'].sum()
n_total = len(results_df)
print(f"Significant tests (FDR < 0.05): {n_sig}/{n_total} ({100*n_sig/n_total:.1f}%)")
print(f"Species with ≥1 significant COG difference: "
f"{results_df[results_df['significant']]['species'].nunique()}")
# Save full results
results_df.to_csv('../data/cog_differentiation_tests.csv', index=False)
print(f"Saved to data/cog_differentiation_tests.csv")
else:
print("No test results")
Significant tests (FDR < 0.05): 170/257 (66.1%) Species with ≥1 significant COG difference: 12 Saved to data/cog_differentiation_tests.csv
5. Which COG categories are most frequently differentiated?¶
In [6]:
if len(results_df) > 0 and results_df['significant'].any():
# Fraction of species showing significant differentiation per COG category
cog_summary = results_df.groupby('COG_category').agg(
n_tested=('species', 'nunique'),
n_significant=('significant', 'sum'),
mean_effect=('max_frac_diff', 'mean'),
mean_effect_sig=('max_frac_diff', lambda x: x[results_df.loc[x.index, 'significant']].mean()
if results_df.loc[x.index, 'significant'].any() else 0),
).reset_index()
cog_summary['frac_significant'] = cog_summary['n_significant'] / cog_summary['n_tested']
cog_summary['COG_name'] = cog_summary['COG_category'].map(COG_NAMES)
cog_summary['category_type'] = cog_summary['COG_category'].apply(
lambda x: 'Adaptive' if x in ADAPTIVE else ('Housekeeping' if x in HOUSEKEEPING else 'Other'))
cog_summary = cog_summary.sort_values('frac_significant', ascending=False)
print("COG differentiation frequency (fraction of species where ecotypes differ):")
print(f"{'COG':>4} {'Name':<30} {'Sig/Tested':>12} {'Frac':>8} {'Type':>14} {'Mean Δ':>8}")
print("-" * 85)
for _, row in cog_summary.iterrows():
print(f"{row['COG_category']:>4} {row['COG_name']:<30} "
f"{int(row['n_significant']):>4}/{int(row['n_tested']):<6} "
f"{row['frac_significant']:>7.3f} "
f"{row['category_type']:>14} "
f"{row['mean_effect']:>7.4f}")
else:
print("No significant results to summarize")
COG differentiation frequency (fraction of species where ecotypes differ): COG Name Sig/Tested Frac Type Mean Δ ------------------------------------------------------------------------------------- E Amino acid metabolism 11/12 0.917 Adaptive 0.0120 S Unknown function 11/12 0.917 Other 0.0392 V Defense 11/12 0.917 Adaptive 0.0078 G Carbohydrate metabolism 10/12 0.833 Adaptive 0.0176 M Cell wall 9/12 0.750 Adaptive 0.0155 L Replication/repair 9/12 0.750 Other 0.0337 C Energy production 9/12 0.750 Housekeeping 0.0095 P Inorganic ion transport 9/12 0.750 Adaptive 0.0170 Q Secondary metabolites 9/12 0.750 Adaptive 0.0112 F Nucleotide metabolism 8/12 0.667 Housekeeping 0.0041 D Cell cycle 8/12 0.667 Other 0.0058 U Secretion 8/12 0.667 Other 0.0106 K Transcription 8/12 0.667 Adaptive 0.0141 H Coenzyme metabolism 8/12 0.667 Housekeeping 0.0069 J Translation 8/12 0.667 Housekeeping 0.0051 T Signal transduction 7/12 0.583 Other 0.0094 O PTM/chaperones 7/12 0.583 Other 0.0056 I Lipid metabolism 7/12 0.583 Other 0.0135 N Cell motility 6/12 0.500 Other 0.0093 W Extracellular 3/6 0.500 Other 0.0003 Z Cytoskeleton 2/6 0.333 Other 0.0005 B Chromatin 1/8 0.125 Other 0.0001 A RNA processing 1/9 0.111 Other 0.0002
6. Test H1: Adaptive vs Housekeeping differentiation rates¶
In [7]:
if len(results_df) > 0 and results_df['significant'].any():
# Compare differentiation rates between adaptive and housekeeping categories
adaptive_results = results_df[results_df['is_adaptive']]
housekeeping_results = results_df[results_df['is_housekeeping']]
adaptive_rate = adaptive_results['significant'].mean() if len(adaptive_results) > 0 else 0
housekeeping_rate = housekeeping_results['significant'].mean() if len(housekeeping_results) > 0 else 0
print(f"Differentiation rates:")
print(f" Adaptive (V,P,G,E,Q,M,K): {adaptive_rate:.3f} ({adaptive_results['significant'].sum()}/{len(adaptive_results)})")
print(f" Housekeeping (J,F,H,C): {housekeeping_rate:.3f} ({housekeeping_results['significant'].sum()}/{len(housekeeping_results)})")
# Mann-Whitney U test on per-species effect sizes
adaptive_effects = adaptive_results['max_frac_diff'].values
housekeeping_effects = housekeeping_results['max_frac_diff'].values
if len(adaptive_effects) > 0 and len(housekeeping_effects) > 0:
stat, p_mwu = mannwhitneyu(adaptive_effects, housekeeping_effects, alternative='greater')
print(f"\nMann-Whitney U test (adaptive > housekeeping effect sizes):")
print(f" U = {stat:.0f}, p = {p_mwu:.4e}")
print(f" Mean effect size - Adaptive: {adaptive_effects.mean():.4f}, Housekeeping: {housekeeping_effects.mean():.4f}")
if p_mwu < 0.05:
print(f"\n→ H1 SUPPORTED: Adaptive categories show significantly larger effect sizes")
else:
print(f"\n→ H0 NOT REJECTED: No significant difference between adaptive and housekeeping")
# Also compare per-category differentiation rates
print(f"\nPer-category differentiation rates:")
for cat_type, cats in [('Adaptive', ADAPTIVE), ('Housekeeping', HOUSEKEEPING)]:
for cat in sorted(cats):
cat_data = results_df[results_df['COG_category'] == cat]
if len(cat_data) > 0:
rate = cat_data['significant'].mean()
mean_eff = cat_data['max_frac_diff'].mean()
print(f" {cat} ({COG_NAMES.get(cat, '?'):.<30s}) "
f"rate={rate:.3f} effect={mean_eff:.4f} ({cat_type})")
else:
print("Insufficient data for hypothesis test")
Differentiation rates: Adaptive (V,P,G,E,Q,M,K): 0.798 (67/84) Housekeeping (J,F,H,C): 0.688 (33/48) Mann-Whitney U test (adaptive > housekeeping effect sizes): U = 2981, p = 2.5269e-06 Mean effect size - Adaptive: 0.0136, Housekeeping: 0.0064 → H1 SUPPORTED: Adaptive categories show significantly larger effect sizes Per-category differentiation rates: E (Amino acid metabolism.........) rate=0.917 effect=0.0120 (Adaptive) G (Carbohydrate metabolism.......) rate=0.833 effect=0.0176 (Adaptive) K (Transcription.................) rate=0.667 effect=0.0141 (Adaptive) M (Cell wall.....................) rate=0.750 effect=0.0155 (Adaptive) P (Inorganic ion transport.......) rate=0.750 effect=0.0170 (Adaptive) Q (Secondary metabolites.........) rate=0.750 effect=0.0112 (Adaptive) V (Defense.......................) rate=0.917 effect=0.0078 (Adaptive) C (Energy production.............) rate=0.750 effect=0.0095 (Housekeeping) F (Nucleotide metabolism.........) rate=0.667 effect=0.0041 (Housekeeping) H (Coenzyme metabolism...........) rate=0.667 effect=0.0069 (Housekeeping) J (Translation...................) rate=0.667 effect=0.0051 (Housekeeping)
7. Visualizations¶
In [8]:
if len(results_df) > 0 and results_df['significant'].any():
# Figure 1: COG differentiation rates bar chart
fig, ax = plt.subplots(figsize=(12, 6))
cog_order = cog_summary.sort_values('frac_significant', ascending=True)
colors = []
for _, row in cog_order.iterrows():
if row['COG_category'] in ADAPTIVE:
colors.append('#d62728') # red for adaptive
elif row['COG_category'] in HOUSEKEEPING:
colors.append('#1f77b4') # blue for housekeeping
else:
colors.append('#7f7f7f') # gray for other
bars = ax.barh(
[f"{row['COG_category']} - {row['COG_name']}" for _, row in cog_order.iterrows()],
cog_order['frac_significant'],
color=colors
)
ax.set_xlabel('Fraction of species with significant\necotype differentiation (FDR < 0.05)')
ax.set_title('COG Category Differentiation Between Ecotypes')
# Legend
from matplotlib.patches import Patch
legend_elements = [
Patch(facecolor='#d62728', label='Adaptive (H1: expected to differ)'),
Patch(facecolor='#1f77b4', label='Housekeeping (H1: expected shared)'),
Patch(facecolor='#7f7f7f', label='Other'),
]
ax.legend(handles=legend_elements, loc='lower right')
plt.tight_layout()
plt.savefig('../figures/cog_differentiation_rates.png', dpi=150, bbox_inches='tight')
plt.close()
print("Saved figures/cog_differentiation_rates.png")
# Figure 2: Effect size distributions (adaptive vs housekeeping)
fig, ax = plt.subplots(figsize=(8, 5))
adaptive_effects = results_df[results_df['is_adaptive']]['max_frac_diff']
housekeeping_effects = results_df[results_df['is_housekeeping']]['max_frac_diff']
other_effects = results_df[~results_df['is_adaptive'] & ~results_df['is_housekeeping']]['max_frac_diff']
ax.hist(adaptive_effects, bins=30, alpha=0.6, label=f'Adaptive (n={len(adaptive_effects)})', color='#d62728')
ax.hist(housekeeping_effects, bins=30, alpha=0.6, label=f'Housekeeping (n={len(housekeeping_effects)})', color='#1f77b4')
ax.hist(other_effects, bins=30, alpha=0.3, label=f'Other (n={len(other_effects)})', color='#7f7f7f')
ax.set_xlabel('Max fraction difference between ecotypes')
ax.set_ylabel('Count')
ax.set_title('Effect Size Distribution: Adaptive vs Housekeeping COGs')
ax.legend()
plt.tight_layout()
plt.savefig('../figures/effect_size_distributions.png', dpi=150, bbox_inches='tight')
plt.close()
print("Saved figures/effect_size_distributions.png")
# Figure 3: Heatmap of significant COG differentiation across species
sig_results = results_df[results_df['significant']]
if len(sig_results) > 5:
# Get top 20 species with most significant COG differences
top_species = (sig_results.groupby('short_name')['significant']
.sum().sort_values(ascending=False).head(20).index)
# All COG categories
cog_cats = sorted(results_df['COG_category'].unique())
# Build matrix: -log10(q_value) for significant results
matrix = np.zeros((len(top_species), len(cog_cats)))
for i, sp in enumerate(top_species):
sp_results = results_df[results_df['short_name'] == sp]
for j, cog in enumerate(cog_cats):
cog_result = sp_results[sp_results['COG_category'] == cog]
if len(cog_result) > 0 and cog_result['significant'].values[0]:
matrix[i, j] = -np.log10(max(cog_result['q_value'].values[0], 1e-10))
fig, ax = plt.subplots(figsize=(14, 8))
im = ax.imshow(matrix, aspect='auto', cmap='YlOrRd')
ax.set_xticks(range(len(cog_cats)))
ax.set_xticklabels(cog_cats, fontsize=9)
ax.set_yticks(range(len(top_species)))
ax.set_yticklabels(top_species, fontsize=8)
ax.set_xlabel('COG Category')
ax.set_ylabel('Species')
ax.set_title('COG Differentiation Significance (-log10 FDR q-value)\nTop 20 species')
plt.colorbar(im, ax=ax, label='-log10(q-value)')
# Mark adaptive and housekeeping categories
for j, cog in enumerate(cog_cats):
if cog in ADAPTIVE:
ax.get_xticklabels()[j].set_color('#d62728')
ax.get_xticklabels()[j].set_fontweight('bold')
elif cog in HOUSEKEEPING:
ax.get_xticklabels()[j].set_color('#1f77b4')
ax.get_xticklabels()[j].set_fontweight('bold')
plt.tight_layout()
plt.savefig('../figures/cog_differentiation_heatmap.png', dpi=150, bbox_inches='tight')
plt.close()
print("Saved figures/cog_differentiation_heatmap.png")
else:
print("Too few significant results for heatmap")
else:
print("No significant results for visualization")
Saved figures/cog_differentiation_rates.png Saved figures/effect_size_distributions.png
Saved figures/cog_differentiation_heatmap.png
8. Summary¶
In [9]:
print("=== NB03 Summary ===")
if len(results_df) > 0:
n_species = results_df['species'].nunique()
n_tests = len(results_df)
n_sig = results_df['significant'].sum()
print(f"Species analyzed: {n_species}")
print(f"Total tests (species × COG): {n_tests}")
print(f"Significant (FDR < 0.05): {n_sig} ({100*n_sig/n_tests:.1f}%)")
print(f"Species with ≥1 significant: {results_df[results_df['significant']]['species'].nunique()}")
if results_df['significant'].any():
adaptive_rate = results_df[results_df['is_adaptive']]['significant'].mean()
housekeeping_rate = results_df[results_df['is_housekeeping']]['significant'].mean()
print(f"\nAdaptive differentiation rate: {adaptive_rate:.3f}")
print(f"Housekeeping differentiation rate: {housekeeping_rate:.3f}")
if adaptive_rate > housekeeping_rate:
print(f"Ratio (adaptive/housekeeping): {adaptive_rate/housekeeping_rate:.2f}x" if housekeeping_rate > 0 else "")
print(f"\nTop 5 most differentiated COG categories:")
for _, row in cog_summary.head(5).iterrows():
cat_type = 'A' if row['COG_category'] in ADAPTIVE else ('H' if row['COG_category'] in HOUSEKEEPING else ' ')
print(f" [{cat_type}] {row['COG_category']} {row['COG_name']}: "
f"{row['frac_significant']:.3f} ({int(row['n_significant'])}/{int(row['n_tested'])} species)")
else:
print("No results")
=== NB03 Summary === Species analyzed: 12 Total tests (species × COG): 257 Significant (FDR < 0.05): 170 (66.1%) Species with ≥1 significant: 12 Adaptive differentiation rate: 0.798 Housekeeping differentiation rate: 0.688 Ratio (adaptive/housekeeping): 1.16x Top 5 most differentiated COG categories: [A] E Amino acid metabolism: 0.917 (11/12 species) [ ] S Unknown function: 0.917 (11/12 species) [A] V Defense: 0.917 (11/12 species) [A] G Carbohydrate metabolism: 0.833 (10/12 species) [A] M Cell wall: 0.750 (9/12 species)