04 Enigma Utilizers
Jupyter notebook from the ENIGMA Carbon Census 1 project.
NB04 — ENIGMA Isolate Utilizer Table (Deliverable a)¶
Turn the (compound × genome) reaction-bridge calls from NB03 into the wet-lab-facing deliverable: for each compound, which ENIGMA isolate strains are predicted to catabolize it, with a confidence tier, the specific enzyme step, and a specificity flag to guide strain selection.
What this notebook adds over NB03:
- Strain-level rollup — collapse multiple genome assemblies of the same strain to one row (best-scoring genome), since the wet lab tests strains, not assemblies.
- Enzyme annotation — attach the KEGG reaction description (e.g. salicylate hydroxylase) so each call names the enzyme to assay.
- Specificity flag —
narrow(≤10 strains),focused(≤50),broad(>50). Narrow calls are the most actionable for picking a discriminating test strain; broad calls (central aromatic intermediates) say most isolates can do this. - Explicit Tier-0 rows — the 68 organism-dark compounds are written into the same table as no-call rows, so the deliverable is complete and the gap is visible, not hidden.
import pandas as pd, numpy as np, os
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from pathlib import Path
from dotenv import load_dotenv
DATA = Path('../data'); FIG = Path('../figures')
pd.set_option('display.max_columns', 40); pd.set_option('display.width', 220)
pred = pd.read_csv(DATA / 'compound_organism_predictions.tsv', sep='\t')
dark = pd.read_csv(DATA / 'compound_organism_dark.tsv', sep='\t')
link = pd.read_csv(DATA / 'compound_linkage_deepened.tsv', sep='\t')
print('prediction rows:', len(pred), '| dark compounds:', len(dark))
prediction rows: 948 | dark compounds: 75
Enzyme descriptions for the signature reactions¶
Pull the human-readable reaction name for each signature R-number so the deliverable names the enzyme step, not just an R-number.
load_dotenv('/home/aparkin/BERIL-research-observatory/.env')
_tok = os.environ['KBASE_AUTH_TOKEN']
from pyspark.sql import SparkSession
_url = f'sc://jupyter-aparkin.jupyterhub-prod:15002/;use_ssl=false;x-kbase-token={_tok}'
spark = SparkSession.builder.remote(_url).getOrCreate()
DB = 'enigma_genome_depot_enigma'
sig_set = sorted({r for s in pred['sig_rxns_carried'].dropna() for r in str(s).split(';') if r})
rstr = ','.join(f"'{r}'" for r in sig_set)
rdesc = spark.sql(f"SELECT kegg_id rxn, description FROM {DB}.browser_kegg_reaction "
f"WHERE kegg_id IN ({rstr})").toPandas()
rdesc_map = dict(zip(rdesc['rxn'], rdesc['description']))
print('signature reactions annotated:', len(rdesc_map), '/', len(sig_set))
for r in sig_set[:12]:
print(f' {r}: {rdesc_map.get(r, "(no desc)")[:80]}')
signature reactions annotated: 13 / 13 R01294: 4-hydroxybenzaldehyde:NADP+ oxidoreductase; 4-Hydroxybenzaldehyde + NADP+ + H2O R01628: 3-hydroxybenzoate,NADPH:oxygen oxidoreductase (4-hydroxylating); 3-Hydroxybenzoa R02107: Xanthine:oxygen oxidoreductase; Xanthine + H2O + Oxygen <=> Urate + Hydrogen per R02612: Phenethylamine:(acceptor) oxidoreductase (deaminating); Phenethylamine + H2O + A R02675: 4-methylphenol:oxidized azurin oxidoreductase (methyl-hydroxylating); 4-Cresol + R02941: salicylaldehyde:NAD+ oxidoreductase; Salicylaldehyde + NAD+ + H2O <=> Salicylate R05148: benzene-1,4-dicarboxylate, NADH:oxygen oxidoreductase (1,2-hydroxylating); Terep R05360: 2-Hydroxy-6-oxo-6-(2-hydroxyphenyl)-hexa-2,4-dienoate + H2O <=> Salicylate + 2-H R05375: 4-Hydroxyphthalate carboxy-lyase; 4-Hydroxyphthalate <=> 3-Hydroxybenzoate + CO2 R05643: 2-formylbenzoate:NAD+ oxidoreductase; 2-Carboxybenzaldehyde + NAD+ + H2O <=> Pht R07202: abscisate,[reduced NADPH---hemoprotein reductase]:oxygen oxidoreductase (8'-hydr R07703: 1,2-dihydrophthalic acid:NAD+ oxidoreductase; 1,2-Dihydrophthalic acid + NAD+ <=
Strain-level rollup¶
Group genomes by strain (NCBI taxon + full name); keep the best-scoring genome per strain.
pred['enzymes'] = pred['sig_rxns_carried'].apply(
lambda s: '; '.join(f'{r} ({rdesc_map.get(r, "?")})' for r in str(s).split(';') if r))
# best genome per (compound, strain)
pred = pred.sort_values('score', ascending=False)
strain = (pred.groupby(['compound_id', 'name', 'npc_pathway', 'kegg_id', 'tier',
'ncbi_taxid', 'taxon_name'], dropna=False)
.agg(best_genome=('genome_name', 'first'),
strain_full=('strain_full', 'first'),
n_genomes=('genome_id', 'nunique'),
score=('score', 'max'),
sig_completeness=('sig_completeness', 'max'),
enzymes=('enzymes', 'first'))
.reset_index())
print('strain-level prediction rows:', len(strain))
print('distinct strains predicted :', strain['ncbi_taxid'].nunique())
strain-level prediction rows: 494 distinct strains predicted : 359
Per-compound specificity + breadth¶
br = (strain.groupby(['compound_id', 'name', 'npc_pathway', 'tier'])
.agg(n_strains=('ncbi_taxid', 'nunique'), max_score=('score', 'max'),
max_completeness=('sig_completeness', 'max')).reset_index())
def spec(n):
return 'narrow' if n <= 10 else ('focused' if n <= 50 else 'broad')
br['specificity'] = br['n_strains'].map(spec)
br = br.sort_values('n_strains')
print(br[['name', 'npc_pathway', 'tier', 'n_strains', 'specificity',
'max_score', 'max_completeness']].to_string(index=False))
name npc_pathway tier n_strains specificity max_score max_completeness
Abscisic acid Terpenoids T3_signature 2 narrow 3.192 1.000
xanthine Alkaloids T3_signature 13 focused 2.214 1.000
Phenylethylamine Alkaloids T3_signature 28 focused 1.584 1.000
TEREPHTHALIC ACID Shikimates and Phenylpropanoids T2_pathway 34 focused 1.506 1.000
phthalic acid Shikimates and Phenylpropanoids T2_pathway 36 focused 3.928 0.667
4-hydroxybenzaldehyde Shikimates and Phenylpropanoids T2_pathway 125 broad 3.367 1.000
3-hydroxybenzoic acid Shikimates and Phenylpropanoids T2_pathway 127 broad 2.669 1.000
salicylic acid Shikimates and Phenylpropanoids T2_pathway 129 broad 3.035 1.000
Genus distribution among predicted utilizers¶
Which genera show up most often as candidate utilizers — useful for choosing test strains already in hand.
strain['genus'] = strain['taxon_name'].astype(str).str.split().str[0]
gtop = (strain.groupby('genus')['compound_id'].nunique().rename('n_compounds')
.reset_index().sort_values('n_compounds', ascending=False).head(20))
gcount = strain['genus'].value_counts().rename('n_calls')
gtop = gtop.merge(gcount, left_on='genus', right_index=True)
print(gtop.to_string(index=False))
genus n_compounds n_calls
Unknown 7 7
environmental 7 7
Acidovorax 5 18
Burkholderiaceae 5 5
Cupriavidus 5 8
Paraburkholderia 5 34
Hydrogenophaga 4 7
Burkholderia 4 19
Variovorax 4 30
Polaromonas 4 7
Comamonas 3 5
Collimonas 3 8
bacterium 3 3
Pseudomonas 3 79
Pseudomonadaceae 3 3
Rhodoferax 3 6
Pigmentiphaga 3 3
Myxococcaceae 3 3
Bosea 3 6
Caulobacter 3 4
fig, ax = plt.subplots(figsize=(8, 6))
g = gtop.sort_values('n_calls')
ax.barh(g['genus'], g['n_calls'], color='#41b6c4')
ax.set_xlabel('candidate-utilizer calls (compound x strain)')
ax.set_title('Top genera among predicted ENIGMA isolate utilizers')
fig.tight_layout(); fig.savefig(FIG / '04_utilizer_genera.png', dpi=150)
print('saved 04_utilizer_genera.png'); plt.close(fig)
saved 04_utilizer_genera.png
Assemble the complete deliverable (a) table¶
Predicted calls + explicit Tier-0 rows for the 68 organism-dark compounds.
call_cols = ['compound_id', 'name', 'npc_pathway', 'kegg_id', 'tier',
'ncbi_taxid', 'taxon_name', 'strain_full', 'best_genome',
'n_genomes', 'score', 'sig_completeness', 'enzymes']
calls = strain[call_cols].copy()
calls = calls.merge(br[['compound_id', 'specificity']].drop_duplicates(), on='compound_id', how='left')
dark_rows = dark.rename(columns={'best_tier': 'linkage_tier'}).copy()
dark_rows['tier'] = 'T0_organism_dark'
for c in ['ncbi_taxid', 'taxon_name', 'strain_full', 'best_genome', 'n_genomes',
'score', 'sig_completeness', 'enzymes', 'specificity']:
dark_rows[c] = np.nan
dark_rows['enzymes'] = dark_rows['organism_dark_reason']
dark_rows = dark_rows[call_cols + ['specificity']]
deliverable = pd.concat([calls, dark_rows], ignore_index=True)
deliverable = deliverable.sort_values(['tier', 'compound_id', 'score'],
ascending=[True, True, False]).reset_index(drop=True)
deliverable.to_csv(DATA / 'enigma_utilizer_predictions.tsv', sep='\t', index=False)
print('wrote data/enigma_utilizer_predictions.tsv', deliverable.shape)
print()
print('tier breakdown (rows):')
print(deliverable['tier'].value_counts().to_string())
print()
print('compounds with >=1 isolate call:', calls['compound_id'].nunique(),
'| organism-dark compounds:', dark_rows['compound_id'].nunique())
wrote data/enigma_utilizer_predictions.tsv (569, 14) tier breakdown (rows): tier T2_pathway 451 T0_organism_dark 75 T3_signature 43 compounds with >=1 isolate call: 8 | organism-dark compounds: 75
Spot-check: the actionable narrow calls¶
The compounds where a single strain choice would be discriminating in the wet lab.
narrow_ids = br[br['specificity'] == 'narrow']['compound_id']
nz = calls[calls['compound_id'].isin(narrow_ids)].sort_values(['name', 'score'], ascending=[True, False])
for nm, grp in nz.groupby('name'):
e = grp.iloc[0]['enzymes']
print(f'\n=== {nm} [{grp.iloc[0]["tier"]}] enzyme: {e[:70]} ===')
print(grp[['taxon_name', 'strain_full', 'score', 'sig_completeness']].head(10).to_string(index=False))
=== Abscisic acid [T3_signature] enzyme: R07202 (abscisate,[reduced NADPH---hemoprotein reductase]:oxygen oxido ===
taxon_name strain_full score sig_completeness
Microbacterium Microbacterium sp. SAI-030 3.192 1.0
Myxococcaceae bacterium NaN 3.192 1.0