Nb07 Metabolic Capability
Jupyter notebook from the Lignin Enrichment and Ecological Memory in Microbial Communities project.
NB07: Metabolic Capability of Lignin-Enriched Taxa¶
Cross-reference lignin-enriched genera from amplicon analysis (NB02-NB06) against the ENIGMA Genome Depot in BERDL to assess genomic potential for lignin degradation.
- Inputs: genus-level composition (NB02), DA results (NB05), BERDL
enigma_genome_depot_enigma - Outputs: metabolic capability matrix, heatmap, bubble plot, consortium model figure
- Key question: Do the enriched genera carry complementary lignin-degradation pathways, suggesting a functional consortium rather than redundant generalists?
In [1]:
import os, pandas as pd, numpy as np
import matplotlib; matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.colors import LinearSegmentedColormap
from pathlib import Path
# Ensure we run from the project root (same as other notebooks)
PROJECT_DIR = Path(__file__).resolve().parent.parent if '__file__' in dir() else Path.cwd()
if PROJECT_DIR.name == 'notebooks':
PROJECT_DIR = PROJECT_DIR.parent
os.chdir(PROJECT_DIR)
from berdl_notebook_utils.setup_spark_session import get_spark_session
spark = get_spark_session()
meta = pd.read_csv('data/sample_metadata.tsv', sep='\t').set_index('sample_id')
tax_16s = pd.read_csv('data/16S/taxonomy.tsv', sep='\t', index_col=0)
otu_rare = pd.read_csv('data/16S/otu_table_rarefied.tsv', sep='\t', index_col=0)
da_base_l = pd.read_csv('data/16S/DA_Base_vs_L.tsv', sep='\t')
group_order = ['Base', 'L', 'LC', 'L-L', 'LC-L', 'L-LC', 'LC-LC']
TARGET_GENERA = ['Pseudomonas', 'Acinetobacter', 'Comamonas',
'Flavobacterium', 'Sphingomonas', 'Rhodococcus', 'Aeromonas']
# Genus-level relative abundance per group
ra = otu_rare.div(otu_rare.sum(axis=1), axis=0) * 100
otu_genus = {otu: (tax_16s.loc[otu, 'Genus'] if otu in tax_16s.index
and pd.notna(tax_16s.loc[otu, 'Genus']) else 'Unclassified')
for otu in ra.columns}
ra.columns = [otu_genus.get(c, 'Unclassified') for c in ra.columns]
genus_ra = ra.T.groupby(ra.columns).sum().T
group_means = {}
for g in group_order:
samps = [s for s in genus_ra.index if meta.loc[s, 'group_label'] == g]
if samps:
group_means[g] = genus_ra.loc[samps].mean()
genus_by_group = pd.DataFrame(group_means).T
print('Genus-level relative abundance (%) for target genera:')
display_cols = [g for g in TARGET_GENERA if g in genus_by_group.columns]
print(genus_by_group[display_cols].round(1).to_string())
Genus-level relative abundance (%) for target genera:
Pseudomonas Acinetobacter Comamonas Flavobacterium Sphingomonas Rhodococcus Aeromonas
Base 0.0 0.0 0.1 0.0 1.6 0.1 0.0
L 34.6 22.0 1.5 12.2 0.1 0.6 0.1
LC 22.2 40.1 0.7 1.3 0.0 0.0 19.2
L-L 46.5 0.4 4.0 5.0 0.8 0.2 0.0
LC-L 21.7 27.8 4.6 2.6 0.1 0.4 5.2
L-LC 46.7 17.9 3.2 2.8 0.1 0.1 1.4
LC-LC 24.4 32.5 4.1 1.4 0.0 0.1 6.2
BERDL Queries: Lignin-Relevant Functional Annotations¶
In [2]:
# Lignin-relevant KEGG orthologs organized by functional category
LIGNIN_KOS = [
'K15733', # DyP peroxidase
'K00422', # polyphenol oxidase
'K05810', # yfiH polyphenol oxidase
'K15063', # ligW 5-carboxyvanillate decarboxylase
'K15060', # ligX dehydrodivanillate O-demethylase
'K15066', # ligM vanillate O-demethylase
'K15064', # desA syringate O-demethylase
'K03381', # catA catechol 1,2-dioxygenase
'K00448', # pcaG protocatechuate 3,4-dioxygenase alpha
'K00449', # pcaH protocatechuate 3,4-dioxygenase beta
'K04100', # ligA protocatechuate 4,5-dioxygenase alpha
'K04101', # ligB protocatechuate 4,5-dioxygenase beta
'K07104', # catE catechol 2,3-dioxygenase
'K00446', # dmpB/xylE catechol 2,3-dioxygenase
'K03862', # vanA vanillate monooxygenase
'K03863', # vanB vanillate monooxygenase ferredoxin
'K04110', # badA benzoate-CoA ligase
'K05549', # benA benzoate 1,2-dioxygenase alpha
'K05550', # benB benzoate 1,2-dioxygenase beta
'K00481', # pobA p-hydroxybenzoate 3-monooxygenase
'K16242', # dmpN phenol hydroxylase P3
'K16243', # dmpL phenol hydroxylase P1
'K03380', # phenol 2-monooxygenase
'K00450', # gentisate 1,2-dioxygenase
'K14333', # DHBD 2,3-dihydroxybenzoate decarboxylase
'K05548', # benK benzoate transporter
'K21758', # benP benzoate porin
'K08195', # pcaK 4-hydroxybenzoate transporter
'K03782', # katG catalase-peroxidase
'K19065', # mobA 3-hydroxybenzoate 4-monooxygenase
'K22270', # nagX 3-hydroxybenzoate 6-monooxygenase
'K13727', # pdc phenolic acid decarboxylase
]
ko_list = ','.join([f"'{k}'" for k in LIGNIN_KOS])
genus_likes = ' OR '.join([f"t.name LIKE '{g}%'" for g in TARGET_GENERA])
genus_case = ' '.join([f"WHEN t.name LIKE '{g}%' THEN '{g}'" for g in TARGET_GENERA])
kegg_query = f"""
SELECT
CASE {genus_case} ELSE t.name END as genus,
ko.kegg_id, ko.description,
COUNT(DISTINCT g.id) as n_genomes,
COUNT(DISTINCT p.id) as n_proteins
FROM enigma_genome_depot_enigma.browser_taxon t
JOIN enigma_genome_depot_enigma.browser_genome g ON g.taxon_id = t.id
JOIN enigma_genome_depot_enigma.browser_gene gn ON gn.genome_id = g.id
JOIN enigma_genome_depot_enigma.browser_protein p ON gn.protein_id = p.id
JOIN enigma_genome_depot_enigma.browser_protein_kegg_orthologs pko ON pko.protein_id = p.id
JOIN enigma_genome_depot_enigma.browser_kegg_ortholog ko ON pko.kegg_ortholog_id = ko.id
WHERE ({genus_likes})
AND ko.kegg_id IN ({ko_list})
GROUP BY 1, ko.kegg_id, ko.description
ORDER BY genus, ko.kegg_id
"""
kegg_hits = spark.sql(kegg_query).toPandas()
print(f'KEGG hits: {len(kegg_hits)} genus-KO combinations')
# Total genomes per genus for prevalence
totals_query = f"""
SELECT
CASE {genus_case} ELSE t.name END as genus,
COUNT(DISTINCT g.id) as total_genomes
FROM enigma_genome_depot_enigma.browser_taxon t
JOIN enigma_genome_depot_enigma.browser_genome g ON g.taxon_id = t.id
WHERE ({genus_likes})
GROUP BY 1
"""
genus_totals = spark.sql(totals_query).toPandas().set_index('genus')['total_genomes']
print(f'\nGenome counts: {dict(genus_totals)}')
kegg_hits.to_csv('data/16S/lignin_kegg_hits.tsv', sep='\t', index=False)
print(kegg_hits.head(10).to_string(index=False))
KEGG hits: 111 genus-KO combinations
Genome counts: {'Sphingomonas': np.int64(38), 'Comamonas': np.int64(23), 'Aeromonas': np.int64(1), 'Rhodococcus': np.int64(21), 'Acinetobacter': np.int64(12), 'Flavobacterium': np.int64(24), 'Pseudomonas': np.int64(508)}
genus kegg_id description n_genomes n_proteins
Acinetobacter K00448 pcaG; protocatechuate 3,4-dioxygenase, alpha subunit [EC:1.13.11.3] 11 7
Acinetobacter K00449 pcaH; protocatechuate 3,4-dioxygenase, beta subunit [EC:1.13.11.3] 11 5
Acinetobacter K00450 E1.13.11.4; gentisate 1,2-dioxygenase [EC:1.13.11.4] 1 1
Acinetobacter K00481 pobA; p-hydroxybenzoate 3-monooxygenase [EC:1.14.13.2] 11 7
Acinetobacter K03381 catA; catechol 1,2-dioxygenase [EC:1.13.11.1] 11 11
Acinetobacter K03782 katG; catalase-peroxidase [EC:1.11.1.21] 7 4
Acinetobacter K03862 vanA; vanillate monooxygenase [EC:1.14.13.82] 11 7
Acinetobacter K03863 vanB; vanillate monooxygenase ferredoxin subunit 11 20
Acinetobacter K04100 ligA; protocatechuate 4,5-dioxygenase, alpha chain [EC:1.13.11.8] 4 1
Acinetobacter K04101 ligB; protocatechuate 4,5-dioxygenase, beta chain [EC:1.13.11.8] 4 1
In [3]:
# CAZy families relevant to lignocellulose
LIGNIN_CAZY = ['AA1','AA2','AA3','AA4','AA5','AA6','AA7','AA10',
'CE1','GH3','GH5','GH8','GH43','CBM1','CBM6']
cazy_list = ','.join([f"'{c}'" for c in LIGNIN_CAZY])
cazy_query = f"""
SELECT
CASE {genus_case} ELSE t.name END as genus,
cf.cazy_id, cf.description,
COUNT(DISTINCT g.id) as n_genomes,
COUNT(DISTINCT p.id) as n_proteins
FROM enigma_genome_depot_enigma.browser_taxon t
JOIN enigma_genome_depot_enigma.browser_genome g ON g.taxon_id = t.id
JOIN enigma_genome_depot_enigma.browser_gene gn ON gn.genome_id = g.id
JOIN enigma_genome_depot_enigma.browser_protein p ON gn.protein_id = p.id
JOIN enigma_genome_depot_enigma.browser_protein_cazy_families pcf ON pcf.protein_id = p.id
JOIN enigma_genome_depot_enigma.browser_cazy_family cf ON pcf.cazy_family_id = cf.id
WHERE ({genus_likes})
AND cf.cazy_id IN ({cazy_list})
GROUP BY 1, cf.cazy_id, cf.description
ORDER BY genus, cf.cazy_id
"""
cazy_hits = spark.sql(cazy_query).toPandas()
print(f'CAZy hits: {len(cazy_hits)} genus-family combinations')
cazy_hits.to_csv('data/16S/lignin_cazy_hits.tsv', sep='\t', index=False)
print(cazy_hits[['genus','cazy_id','n_genomes','n_proteins']].to_string(index=False))
CAZy hits: 29 genus-family combinations
genus cazy_id n_genomes n_proteins
Acinetobacter CE1 11 9
Aeromonas AA10 1 1
Aeromonas CE1 1 1
Aeromonas GH3 1 1
Aeromonas GH8 1 1
Comamonas CE1 23 7
Comamonas GH5 23 12
Comamonas GH8 11 1
Flavobacterium CBM6 4 2
Flavobacterium CE1 1 1
Flavobacterium GH3 22 92
Flavobacterium GH43 20 37
Flavobacterium GH5 22 53
Flavobacterium GH8 13 7
Pseudomonas AA10 393 120
Pseudomonas CE1 502 134
Pseudomonas GH3 493 209
Pseudomonas GH43 41 18
Pseudomonas GH5 163 56
Pseudomonas GH8 153 50
Rhodococcus GH3 19 17
Rhodococcus GH43 1 1
Rhodococcus GH5 17 37
Sphingomonas CBM6 1 1
Sphingomonas CE1 38 33
Sphingomonas GH3 38 153
Sphingomonas GH43 29 52
Sphingomonas GH5 21 13
Sphingomonas GH8 5 4
Metabolic Capability Matrix¶
In [4]:
# Organize KOs into functional categories
PATHWAY_CATEGORIES = {
'DyP peroxidase': ['K15733'],
'Polyphenol oxidase': ['K00422', 'K05810'],
'Lignin dimer processing (ligWX)': ['K15063', 'K15060'],
'Catechol ortho-cleavage (catA)': ['K03381'],
'Protocatechuate ortho (pcaGH)': ['K00448', 'K00449'],
'Protocatechuate meta (ligAB)': ['K04100', 'K04101'],
'Catechol meta-cleavage': ['K07104', 'K00446'],
'Vanillate catabolism (vanAB)': ['K03862', 'K03863'],
'Benzoate dioxygenase (benAB)': ['K05549', 'K05550'],
'Benzoate-CoA ligase (badA)': ['K04110'],
'p-HBA monooxygenase (pobA)': ['K00481'],
'Phenol hydroxylase': ['K16242', 'K16243', 'K03380'],
'Gentisate dioxygenase': ['K00450'],
'Aromatic transport (benK/pcaK)': ['K05548', 'K08195'],
}
CAZY_CATEGORIES = {
'LPMO (AA10)': ['AA10'],
'Feruloyl esterase (CE1)': ['CE1'],
'Beta-glucosidase (GH3)': ['GH3'],
'Cellulase (GH5)': ['GH5'],
'Xylanase (GH8/GH43)': ['GH8', 'GH43'],
}
genera_order = ['Pseudomonas', 'Acinetobacter', 'Flavobacterium',
'Comamonas', 'Sphingomonas', 'Rhodococcus']
def compute_prevalence(hits_df, gene_ids, id_col, genus, total):
"""Max prevalence across gene IDs for a genus."""
subset = hits_df[(hits_df['genus'] == genus) & (hits_df[id_col].isin(gene_ids))]
if subset.empty:
return 0.0, 0
return subset['n_genomes'].max() / total * 100, int(subset['n_proteins'].sum())
rows_prev = []
rows_copies = []
all_categories = list(PATHWAY_CATEGORIES.keys()) + list(CAZY_CATEGORIES.keys())
for cat in all_categories:
prev_row = {'category': cat}
copy_row = {'category': cat}
if cat in PATHWAY_CATEGORIES:
ids = PATHWAY_CATEGORIES[cat]
hits_df, id_col = kegg_hits, 'kegg_id'
else:
ids = CAZY_CATEGORIES[cat]
hits_df, id_col = cazy_hits, 'cazy_id'
for genus in genera_order:
total = genus_totals.get(genus, 1)
prev, copies = compute_prevalence(hits_df, ids, id_col, genus, total)
prev_row[genus] = round(prev, 1)
copy_row[genus] = copies
rows_prev.append(prev_row)
rows_copies.append(copy_row)
cap_matrix = pd.DataFrame(rows_prev).set_index('category')
copy_matrix = pd.DataFrame(rows_copies).set_index('category')
cap_matrix.to_csv('data/16S/metabolic_capability_matrix.tsv', sep='\t')
print('Metabolic capability matrix (% genomes with gene):')
print(cap_matrix.to_string())
Metabolic capability matrix (% genomes with gene):
Pseudomonas Acinetobacter Flavobacterium Comamonas Sphingomonas Rhodococcus
category
DyP peroxidase 0.0 0.0 0.0 0.0 0.0 85.7
Polyphenol oxidase 98.8 91.7 45.8 100.0 100.0 90.5
Lignin dimer processing (ligWX) 0.0 0.0 0.0 0.0 15.8 0.0
Catechol ortho-cleavage (catA) 85.0 91.7 4.2 52.2 2.6 85.7
Protocatechuate ortho (pcaGH) 97.4 91.7 8.3 4.3 18.4 90.5
Protocatechuate meta (ligAB) 9.1 33.3 0.0 100.0 13.2 0.0
Catechol meta-cleavage 0.6 0.0 4.2 47.8 2.6 9.5
Vanillate catabolism (vanAB) 36.0 91.7 0.0 100.0 10.5 4.8
Benzoate dioxygenase (benAB) 69.3 83.3 0.0 0.0 2.6 61.9
Benzoate-CoA ligase (badA) 0.0 0.0 0.0 100.0 21.1 0.0
p-HBA monooxygenase (pobA) 97.6 91.7 0.0 100.0 21.1 76.2
Phenol hydroxylase 1.6 33.3 4.2 47.8 0.0 0.0
Gentisate dioxygenase 9.4 8.3 29.2 100.0 47.4 28.6
Aromatic transport (benK/pcaK) 97.2 91.7 0.0 100.0 21.1 71.4
LPMO (AA10) 77.4 0.0 0.0 0.0 0.0 0.0
Feruloyl esterase (CE1) 98.8 91.7 4.2 100.0 100.0 0.0
Beta-glucosidase (GH3) 97.0 0.0 91.7 0.0 100.0 90.5
Cellulase (GH5) 32.1 0.0 91.7 100.0 55.3 81.0
Xylanase (GH8/GH43) 30.1 0.0 83.3 47.8 76.3 4.8
Metabolic Capability Heatmap¶
In [5]:
fig, ax = plt.subplots(figsize=(10, 10))
cmap = LinearSegmentedColormap.from_list('cap', ['#f7f7f7', '#d4e6f1', '#2980b9', '#1a5276'])
data = cap_matrix.values.astype(float)
im = ax.imshow(data, cmap=cmap, aspect='auto', vmin=0, vmax=100)
ax.set_xticks(range(len(genera_order)))
ax.set_xticklabels([f'{g}\n(n={genus_totals.get(g,"?")})' for g in genera_order],
rotation=45, ha='right', fontsize=10)
ax.set_yticks(range(len(all_categories)))
ax.set_yticklabels(all_categories, fontsize=9)
for i in range(data.shape[0]):
for j in range(data.shape[1]):
val = data[i, j]
color = 'white' if val > 60 else 'black'
text = f'{val:.0f}' if val > 0 else '-'
ax.text(j, i, text, ha='center', va='center', fontsize=8, color=color)
# Divider between KEGG and CAZy sections
n_kegg = len(PATHWAY_CATEGORIES)
ax.axhline(y=n_kegg - 0.5, color='black', linewidth=1.5, linestyle='--')
ax.text(-0.7, n_kegg / 2 - 0.5, 'KEGG\northologs', ha='right', va='center',
fontsize=9, fontweight='bold', rotation=90)
ax.text(-0.7, n_kegg + len(CAZY_CATEGORIES) / 2 - 0.5, 'CAZy\nfamilies',
ha='right', va='center', fontsize=9, fontweight='bold', rotation=90)
cbar = plt.colorbar(im, ax=ax, shrink=0.6, pad=0.02)
cbar.set_label('Gene prevalence (% of genomes)', fontsize=10)
ax.set_title('Lignin-Degradation Gene Repertoire of Enriched Genera\n'
'(ENIGMA Genome Depot, BERDL)', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig('figures/metabolic_capability_heatmap.png', dpi=150, bbox_inches='tight')
plt.close()
print('Saved figures/metabolic_capability_heatmap.png')
Saved figures/metabolic_capability_heatmap.png
In [6]:
# Bubble plot: size = gene copies, color = prevalence
fig, ax = plt.subplots(figsize=(11, 10))
prev_vals = cap_matrix.values.astype(float)
copy_vals = copy_matrix.values.astype(float)
for i in range(len(all_categories)):
for j in range(len(genera_order)):
prev = prev_vals[i, j]
copies = copy_vals[i, j]
if copies == 0:
ax.plot(j, i, 'x', color='#cccccc', markersize=5)
continue
size = max(20, min(500, copies * 3))
ax.scatter(j, i, s=size, c=[prev], cmap=cmap, vmin=0, vmax=100,
edgecolors='#333333', linewidth=0.5, zorder=3)
ax.set_xticks(range(len(genera_order)))
ax.set_xticklabels(genera_order, rotation=45, ha='right', fontsize=10)
ax.set_yticks(range(len(all_categories)))
ax.set_yticklabels(all_categories, fontsize=9)
ax.axhline(y=n_kegg - 0.5, color='black', linewidth=1.5, linestyle='--')
ax.set_xlim(-0.5, len(genera_order) - 0.5)
ax.set_ylim(len(all_categories) - 0.5, -0.5)
ax.grid(True, alpha=0.2)
sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(0, 100))
cbar = plt.colorbar(sm, ax=ax, shrink=0.5, pad=0.02)
cbar.set_label('Gene prevalence (%)', fontsize=10)
for sz, label in [(30, '10'), (150, '50'), (450, '150')]:
ax.scatter([], [], s=sz, c='#999999', edgecolors='#333333',
linewidth=0.5, label=f'{label} proteins')
ax.legend(title='Gene copies', loc='lower right', fontsize=8, title_fontsize=9)
ax.set_title('Lignin-Degradation Gene Prevalence and Copy Number\n'
'Bubble size = protein count, color = genome prevalence',
fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig('figures/metabolic_pathway_roles.png', dpi=150, bbox_inches='tight')
plt.close()
print('Saved figures/metabolic_pathway_roles.png')
Saved figures/metabolic_pathway_roles.png
Lignin-Degrading Consortium: Functional Model¶
In [7]:
fig, ax = plt.subplots(figsize=(14, 9))
ax.set_xlim(0, 14)
ax.set_ylim(0, 9)
ax.axis('off')
genus_colors = {
'Rhodococcus': '#E74C3C', 'Sphingomonas': '#F39C12',
'Pseudomonas': '#3498DB', 'Acinetobacter': '#2ECC71',
'Comamonas': '#9B59B6', 'Flavobacterium': '#1ABC9C'
}
def draw_box(ax, x, y, w, h, text, color, fontsize=9):
rect = mpatches.FancyBboxPatch((x, y), w, h, boxstyle='round,pad=0.15',
facecolor=color, edgecolor='#333333',
linewidth=1.5, alpha=0.85)
ax.add_patch(rect)
ax.text(x + w/2, y + h/2, text, ha='center', va='center',
fontsize=fontsize, fontweight='bold', wrap=True)
def draw_arrow(ax, x1, y1, x2, y2, label='', color='#555555'):
ax.annotate('', xy=(x2, y2), xytext=(x1, y1),
arrowprops=dict(arrowstyle='->', color=color, lw=2))
if label:
mx, my = (x1+x2)/2, (y1+y2)/2
ax.text(mx, my + 0.15, label, ha='center', va='bottom',
fontsize=7, style='italic', color='#555555')
# Substrate boxes (grey)
draw_box(ax, 0.5, 7.0, 3.0, 1.2, 'LIGNIN\nPOLYMER', '#D5D8DC', fontsize=11)
draw_box(ax, 5.0, 7.0, 3.0, 1.0, 'Lignin-Carbohydrate\nComplexes', '#D5D8DC', fontsize=9)
draw_box(ax, 1.0, 4.5, 2.5, 0.9, 'Dimers &\nOligomers', '#EAECEE', fontsize=9)
draw_box(ax, 1.0, 2.2, 2.5, 0.9, 'Aromatic\nMonomers', '#EAECEE', fontsize=9)
draw_box(ax, 5.5, 4.5, 2.0, 0.9, 'Cellulose &\nHemicellulose', '#EAECEE', fontsize=9)
draw_box(ax, 5.0, 0.3, 3.0, 0.8, 'TCA Cycle Intermediates', '#D5DBDB', fontsize=9)
draw_box(ax, 9.0, 4.5, 2.0, 0.9, 'Sugars', '#EAECEE', fontsize=9)
# Genus boxes with their key enzymes
draw_box(ax, 5.0, 2.0, 2.5, 1.2, 'Pseudomonas\nAcinetobacter\n(catA, pcaGH, benAB)',
genus_colors['Pseudomonas'], fontsize=8)
draw_box(ax, 8.0, 2.0, 2.5, 1.2, 'Comamonas\n(ligAB, badA\nmeta-cleavage)',
genus_colors['Comamonas'], fontsize=8)
draw_box(ax, 10.0, 7.0, 3.0, 1.0, 'Flavobacterium\n(GH3, GH5, GH43)',
genus_colors['Flavobacterium'], fontsize=8)
# Step 1: Lignin -> dimers (Rhodococcus)
ax.annotate('Rhodococcus\nDyP peroxidase', xy=(2.0, 7.0), xytext=(2.0, 6.1),
fontsize=8, fontweight='bold', color=genus_colors['Rhodococcus'],
ha='center', va='top',
arrowprops=dict(arrowstyle='->', color=genus_colors['Rhodococcus'], lw=2.5))
# Step 2: Dimers -> monomers (Sphingomonas)
ax.annotate('Sphingomonas\nligW, ligX', xy=(2.25, 4.5), xytext=(2.25, 3.6),
fontsize=8, fontweight='bold', color=genus_colors['Sphingomonas'],
ha='center', va='top',
arrowprops=dict(arrowstyle='->', color=genus_colors['Sphingomonas'], lw=2.5))
# Step 3: Monomers -> TCA (two parallel paths)
draw_arrow(ax, 3.5, 2.5, 5.0, 2.5, '', '#3498DB') # ortho path
draw_arrow(ax, 3.5, 2.8, 8.0, 2.8, '', '#9B59B6') # meta path
draw_arrow(ax, 6.25, 2.0, 6.25, 1.1, 'ortho-cleavage', '#3498DB')
draw_arrow(ax, 9.25, 2.0, 8.0, 1.1, 'meta-cleavage', '#9B59B6')
# LCC path
draw_arrow(ax, 6.5, 7.0, 6.5, 5.4, '', '#1ABC9C')
draw_arrow(ax, 7.5, 4.5, 9.0, 4.9, '', '#555555')
# Flavobacterium path
draw_arrow(ax, 11.5, 7.0, 11.5, 5.4, '', '#1ABC9C')
ax.set_title('Proposed Functional Roles in the Lignin-Degrading Consortium',
fontsize=14, fontweight='bold', pad=20)
handles = [mpatches.Patch(color=c, label=g) for g, c in genus_colors.items()]
ax.legend(handles=handles, loc='lower left', fontsize=8, framealpha=0.9,
title='Genera', title_fontsize=9)
plt.tight_layout()
plt.savefig('figures/consortium_functional_model.png', dpi=150, bbox_inches='tight')
plt.close()
print('Saved figures/consortium_functional_model.png')
Saved figures/consortium_functional_model.png
Community Metabolic Potential by Treatment Group¶
In [8]:
# Community metabolic potential = sum(genus_RA * genus_prevalence) per pathway per group
# This estimates the community's aggregate genetic potential for each pathway step
community_potential = {}
for group in group_order:
group_pot = {}
for cat in all_categories:
score = 0.0
for genus in genera_order:
if genus in genus_by_group.columns:
ra_val = genus_by_group.loc[group, genus] / 100
prev_val = cap_matrix.loc[cat, genus] / 100
score += ra_val * prev_val
group_pot[cat] = round(score * 100, 2)
community_potential[group] = group_pot
cp_df = pd.DataFrame(community_potential).T
cp_df.index.name = 'Group'
print('Community metabolic potential (RA-weighted prevalence score):')
print(cp_df.to_string())
print()
# Compare L-pathway vs LC-pathway groups
l_groups = ['L', 'L-L', 'L-LC']
lc_groups = ['LC', 'LC-L', 'LC-LC']
l_mean = cp_df.loc[l_groups].mean()
lc_mean = cp_df.loc[lc_groups].mean()
diff = l_mean - lc_mean
comparison = pd.DataFrame({'L-history mean': l_mean.round(2),
'LC-history mean': lc_mean.round(2),
'Difference (L - LC)': diff.round(2)})
print('\nL-history vs LC-history pathway potential:')
print(comparison.to_string())
Community metabolic potential (RA-weighted prevalence score):
DyP peroxidase Polyphenol oxidase Lignin dimer processing (ligWX) Catechol ortho-cleavage (catA) Protocatechuate ortho (pcaGH) Protocatechuate meta (ligAB) Catechol meta-cleavage Vanillate catabolism (vanAB) Benzoate dioxygenase (benAB) Benzoate-CoA ligase (badA) p-HBA monooxygenase (pobA) Phenol hydroxylase Gentisate dioxygenase Aromatic transport (benK/pcaK) LPMO (AA10) Feruloyl esterase (CE1) Beta-glucosidase (GH3) Cellulase (GH5) Xylanase (GH8/GH43)
Group
Base 0.12 1.85 0.25 0.23 0.44 0.32 0.11 0.29 0.14 0.45 0.57 0.06 0.91 0.56 0.01 1.70 1.75 1.14 1.30
L 0.48 62.05 0.02 51.35 55.46 11.99 1.50 34.16 42.64 1.52 55.87 9.11 10.36 55.71 26.76 56.45 45.36 24.33 21.42
LC 0.03 59.99 0.00 56.05 58.52 16.06 0.52 45.45 48.78 0.69 59.12 14.09 6.49 59.03 17.15 59.41 22.72 9.03 8.09
L-L 0.13 53.45 0.12 42.30 46.50 8.42 2.42 21.14 32.66 4.12 49.96 2.98 10.24 49.77 35.97 51.22 50.60 24.04 20.67
LC-L 0.33 53.16 0.02 46.78 47.41 15.84 2.47 37.93 38.44 4.61 51.59 11.91 9.85 51.48 16.79 51.74 23.85 14.28 10.95
L-LC 0.08 67.33 0.02 58.06 62.44 13.48 1.96 36.52 47.39 3.26 65.39 8.38 10.01 65.20 36.17 66.09 48.08 20.93 18.02
LC-LC 0.05 58.68 0.01 52.77 53.89 17.14 2.16 42.69 44.01 4.09 57.73 13.23 9.50 57.63 18.85 58.06 24.95 13.21 10.44
L-history vs LC-history pathway potential:
L-history mean LC-history mean Difference (L - LC)
DyP peroxidase 0.23 0.14 0.09
Polyphenol oxidase 60.94 57.28 3.67
Lignin dimer processing (ligWX) 0.05 0.01 0.04
Catechol ortho-cleavage (catA) 50.57 51.87 -1.30
Protocatechuate ortho (pcaGH) 54.80 53.27 1.53
Protocatechuate meta (ligAB) 11.30 16.35 -5.05
Catechol meta-cleavage 1.96 1.72 0.24
Vanillate catabolism (vanAB) 30.61 42.02 -11.42
Benzoate dioxygenase (benAB) 40.90 43.74 -2.85
Benzoate-CoA ligase (badA) 2.97 3.13 -0.16
p-HBA monooxygenase (pobA) 57.07 56.15 0.93
Phenol hydroxylase 6.82 13.08 -6.25
Gentisate dioxygenase 10.20 8.61 1.59
Aromatic transport (benK/pcaK) 56.89 56.05 0.85
LPMO (AA10) 32.97 17.60 15.37
Feruloyl esterase (CE1) 57.92 56.40 1.52
Beta-glucosidase (GH3) 48.01 23.84 24.17
Cellulase (GH5) 23.10 12.17 10.93
Xylanase (GH8/GH43) 20.04 9.83 10.21
In [9]:
# Summary: one row per genus with enrichment context, key capabilities, and proposed role
summary_data = []
roles = {
'Pseudomonas': ('Dominant in L groups (39-53%)',
'catA, pcaGH, benAB, vanAB, AA10, CE1',
'Generalist aromatic monomer processor (ortho-cleavage)'),
'Acinetobacter': ('Dominant in LC groups (31-44%)',
'catA, pcaGH, benAB, vanAB, phenol hydroxylase',
'Copiotrophic aromatic degrader, overlaps Pseudomonas'),
'Flavobacterium': ('Enriched in L (14%), depleted in LC',
'GH3, GH5, GH43, GH8, polyphenol oxidase',
'Polysaccharide specialist (lignin-carbohydrate complexes)'),
'Comamonas': ('Moderate enrichment (2-5%), stable across conditions',
'ligAB (100%), badA (100%), catechol meta-cleavage',
'Alternative meta-cleavage pathway, anaerobic benzoate'),
'Sphingomonas': ('Low RA in amplicons, known lignin specialist',
'ligW, ligX, ligAB, CE1, GH3, GH43',
'Lignin dimer specialist (unique ligW/ligX)'),
'Rhodococcus': ('Low RA in amplicons, only DyP+ genus',
'DyP peroxidase (86%), catA, pcaGH, benAB',
'Lignin depolymerization initiator (DyP peroxidase)'),
}
for genus in genera_order:
context, genes, role = roles[genus]
summary_data.append({
'Genus': genus,
'Genomes in BERDL': genus_totals.get(genus, 0),
'Enrichment context': context,
'Key lignin genes': genes,
'Proposed consortium role': role
})
summary_df = pd.DataFrame(summary_data)
summary_df.to_csv('data/16S/consortium_roles_summary.tsv', sep='\t', index=False)
print('Lignin-degrading consortium — genus roles:')
print(summary_df.to_string(index=False))
print('\nSaved data/16S/consortium_roles_summary.tsv')
Lignin-degrading consortium — genus roles:
Genus Genomes in BERDL Enrichment context Key lignin genes Proposed consortium role
Pseudomonas 508 Dominant in L groups (39-53%) catA, pcaGH, benAB, vanAB, AA10, CE1 Generalist aromatic monomer processor (ortho-cleavage)
Acinetobacter 12 Dominant in LC groups (31-44%) catA, pcaGH, benAB, vanAB, phenol hydroxylase Copiotrophic aromatic degrader, overlaps Pseudomonas
Flavobacterium 24 Enriched in L (14%), depleted in LC GH3, GH5, GH43, GH8, polyphenol oxidase Polysaccharide specialist (lignin-carbohydrate complexes)
Comamonas 23 Moderate enrichment (2-5%), stable across conditions ligAB (100%), badA (100%), catechol meta-cleavage Alternative meta-cleavage pathway, anaerobic benzoate
Sphingomonas 38 Low RA in amplicons, known lignin specialist ligW, ligX, ligAB, CE1, GH3, GH43 Lignin dimer specialist (unique ligW/ligX)
Rhodococcus 21 Low RA in amplicons, only DyP+ genus DyP peroxidase (86%), catA, pcaGH, benAB Lignin depolymerization initiator (DyP peroxidase)
Saved data/16S/consortium_roles_summary.tsv