06 Phylo Maps
Jupyter notebook from the ENIGMA Carbon Census 1 project.
NB06 — Per-compound Phylogenetic Utilizer Map (Deliverable b)¶
Place the predicted ENIGMA-isolate utilizers from NB03/NB04 onto a taxonomic backbone, with a per-prediction certainty score, so the wet lab can see where in the tree each compound's predicted catabolizers sit.
What backbone (honesty note). genome_depot's browser_taxon is the NCBI taxonomy (ranks + parent_id chain), not GTDB with branch lengths. So this deliverable is an NCBI-lineage placement (domain→phylum→class→order→family→genus→species), not a true phylogeny. That is the resolution the in-lakehouse ENIGMA isolate metadata supports; a branch-length GTDB tree would require the pangenome assembly-accession bridge (deferred — adds no resolution for the wet-lab strain-selection use case, which is genus/family-level).
Certainty score. Per (compound, strain): evidence tier (T2 degradation-map > T3 allowlist) combined with sig_completeness (fraction of the compound's catabolic signature reactions the genome carries). Derived label: high = T2 & completeness=1; medium = T2 or completeness=1; low otherwise.
import pandas as pd, numpy as np, os
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from pathlib import Path
from dotenv import load_dotenv
DATA = Path('../data'); FIG = Path('../figures')
pd.set_option('display.max_columns', 40); pd.set_option('display.width', 220)
pred = pd.read_csv(DATA / 'compound_organism_predictions.tsv', sep='\t')
print('prediction rows:', len(pred), '| genomes:', pred['genome_id'].nunique(),
'| compounds:', pred['compound_id'].nunique())
prediction rows: 948 | genomes: 800 | compounds: 8
Build NCBI lineages from the genome_depot taxonomy backbone¶
browser_taxon.parent_id points to the parent's taxonomy_id (NCBI taxid). Pull the whole table (small) and walk each predicted genome's taxid up to the root, collecting the standard ranks.
load_dotenv('/home/aparkin/BERIL-research-observatory/.env')
_tok = os.environ['KBASE_AUTH_TOKEN']
from pyspark.sql import SparkSession
_url = f'sc://jupyter-aparkin.jupyterhub-prod:15002/;use_ssl=false;x-kbase-token={_tok}'
spark = SparkSession.builder.remote(_url).getOrCreate()
DB = 'enigma_genome_depot_enigma'
tax = spark.sql(f'SELECT taxonomy_id, parent_id, rank, name FROM {DB}.browser_taxon').toPandas()
tax['taxonomy_id'] = tax['taxonomy_id'].astype(str)
tax['parent_id'] = tax['parent_id'].astype(str)
by_tid = tax.set_index('taxonomy_id')
print('taxon rows:', len(tax))
taxon rows: 3597
RANKS = ['superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
# NCBI uses 'superkingdom'/'domain' inconsistently; accept either at the top
def lineage(tid):
out = {r: None for r in RANKS}
tid = str(tid); seen = set()
for _ in range(40):
if tid not in by_tid.index or tid in seen:
break
seen.add(tid)
row = by_tid.loc[tid]
if isinstance(row, pd.DataFrame):
row = row.iloc[0]
rk = row['rank']
if rk == 'domain':
rk = 'superkingdom'
if rk in out and out[rk] is None:
out[rk] = row['name']
tid = str(row['parent_id'])
return out
uniq = pred[['ncbi_taxid']].drop_duplicates()
uniq['ncbi_taxid_s'] = uniq['ncbi_taxid'].astype('Int64').astype(str)
lin = uniq['ncbi_taxid_s'].map(lambda t: lineage(t)).apply(pd.Series)
lin['ncbi_taxid'] = uniq['ncbi_taxid'].values
print('lineage coverage (non-null):')
for r in RANKS:
print(f' {r:14s} {lin[r].notna().sum():4d}/{len(lin)}')
print('\nsample:')
print(lin[['ncbi_taxid','phylum','class','order','family','genus']].head(8).to_string(index=False))
lineage coverage (non-null):
superkingdom 358/359
phylum 356/359
class 354/359
order 351/359
family 348/359
genus 343/359
species 324/359
sample:
ncbi_taxid phylum class order family genus
1955812 Pseudomonadota Betaproteobacteria Burkholderiales Alcaligenaceae Castellaniella
1822464 Pseudomonadota Betaproteobacteria Burkholderiales Burkholderiaceae Paraburkholderia
1882791 Pseudomonadota Betaproteobacteria Burkholderiales Burkholderiaceae Burkholderia
1882792 Pseudomonadota Betaproteobacteria Burkholderiales Burkholderiaceae Burkholderia
1144309 Pseudomonadota Betaproteobacteria Burkholderiales Burkholderiaceae Burkholderia
1245441 Pseudomonadota Betaproteobacteria Burkholderiales Burkholderiaceae Paraburkholderia
169430 Pseudomonadota Betaproteobacteria Burkholderiales Burkholderiaceae Paraburkholderia
2004485 Pseudomonadota Betaproteobacteria Burkholderiales Comamonadaceae Acidovorax
Strain-level utilizers with lineage + certainty¶
# collapse genomes -> strains (best-scoring genome per compound x strain), like NB04
p = pred.sort_values('score', ascending=False)
strain = (p.groupby(['compound_id', 'name', 'npc_pathway', 'tier', 'ncbi_taxid', 'taxon_name'],
dropna=False)
.agg(score=('score', 'max'),
sig_completeness=('sig_completeness', 'max'),
n_sig_carried=('n_sig_carried', 'max'),
n_genomes=('genome_id', 'nunique')).reset_index())
strain = strain.merge(lin, on='ncbi_taxid', how='left')
def certainty(row):
t2 = row['tier'] == 'T2_pathway'
full = row['sig_completeness'] >= 1.0
if t2 and full: return 'high'
if t2 or full: return 'medium'
return 'low'
strain['certainty'] = strain.apply(certainty, axis=1)
print('strain-level rows:', len(strain), '| distinct strains:', strain['ncbi_taxid'].nunique())
print('\ncertainty x tier:')
print(pd.crosstab(strain['certainty'], strain['tier']).to_string())
strain-level rows: 494 | distinct strains: 359 certainty x tier: tier T2_pathway T3_signature certainty high 64 0 medium 387 43
Phylogenetic placement: compound x taxonomic order¶
The core deliverable-(b) map — for each compound, how its predicted utilizer strains distribute across taxonomic orders, shaded by utilizer count. Orders (not genera) keep the map legible while showing clade structure.
strain['order_lbl'] = strain['order'].fillna(strain['class']).fillna('(unresolved)')
pc = (strain.groupby(['order_lbl', 'name'])['ncbi_taxid'].nunique()
.unstack(fill_value=0))
comp_order = (strain.groupby('name')['ncbi_taxid'].nunique()
.sort_values(ascending=False).index.tolist())
pc = pc[comp_order]
pc = pc.loc[pc.sum(axis=1).sort_values(ascending=False).index]
print('orders x compounds (utilizer strain counts):')
print(pc.to_string())
orders x compounds (utilizer strain counts): name salicylic acid 3-hydroxybenzoic acid 4-hydroxybenzaldehyde phthalic acid TEREPHTHALIC ACID Phenylethylamine xanthine Abscisic acid order_lbl Burkholderiales 60 66 39 7 26 24 1 0 Pseudomonadales 54 18 10 0 0 0 0 0 Sphingomonadales 1 1 47 0 0 0 0 0 Hyphomicrobiales 1 19 4 5 1 1 4 0 (unresolved) 2 4 4 3 2 2 3 0 Micrococcales 0 0 3 10 0 1 0 1 Enterobacterales 8 0 0 0 0 0 2 0 Mycobacteriales 0 0 1 6 0 0 0 0 Bacillales 0 3 0 1 3 0 0 0 Caulobacterales 1 5 0 0 0 0 1 0 Kitasatosporales 0 0 4 3 0 0 0 0 Sphingobacteriales 0 5 0 0 0 0 1 0 Rhodospirillales 0 0 2 0 1 0 0 0 Myxococcales 0 0 1 0 0 0 1 1 Nitrosomonadales 1 0 1 0 0 0 0 0 Lysobacterales 1 0 1 0 0 0 0 0 Cellvibrionales 0 0 2 0 0 0 0 0 Actinomycetes 0 1 0 1 0 0 0 0 Betaproteobacteria 0 1 0 0 1 0 0 0 Chitinophagales 0 2 0 0 0 0 0 0 Cytophagales 0 0 1 0 0 0 0 0 Alteromonadales 0 0 1 0 0 0 0 0 Acetobacterales 0 1 0 0 0 0 0 0 Gammaproteobacteria 0 0 1 0 0 0 0 0 Rhodocyclales 0 0 1 0 0 0 0 0 Rhodobacterales 0 1 0 0 0 0 0 0 Propionibacteriales 0 0 1 0 0 0 0 0 Terriglobales 0 0 1 0 0 0 0 0
fig, ax = plt.subplots(figsize=(9, max(4, 0.34*len(pc))))
im = ax.imshow(pc.values, cmap='YlGnBu', aspect='auto')
ax.set_xticks(range(len(pc.columns)))
ax.set_xticklabels(pc.columns, rotation=45, ha='right', fontsize=8)
ax.set_yticks(range(len(pc.index))); ax.set_yticklabels(pc.index, fontsize=8)
for i in range(len(pc.index)):
for j in range(len(pc.columns)):
v = pc.values[i, j]
if v > 0:
ax.text(j, i, int(v), ha='center', va='center', fontsize=6,
color='white' if v > pc.values.max()*0.5 else 'black')
ax.set_title('Predicted utilizer strains by taxonomic order (deliverable b)')
fig.colorbar(im, ax=ax, shrink=0.6, label='utilizer strains')
fig.tight_layout(); fig.savefig(FIG / '06_phylo_order_map.png', dpi=150)
print('saved 06_phylo_order_map.png'); plt.close(fig)
saved 06_phylo_order_map.png
Per-compound certainty composition¶
For each compound, the certainty mix of its predicted utilizers — the wet-lab view of how trustworthy the call set is.
cc = (strain.groupby(['name', 'certainty'])['ncbi_taxid'].nunique()
.unstack(fill_value=0).reindex(columns=['high', 'medium', 'low'], fill_value=0))
cc = cc.loc[strain.groupby('name')['ncbi_taxid'].nunique().sort_values().index]
print(cc.to_string())
fig, ax = plt.subplots(figsize=(8, 4.5))
cc.plot(kind='barh', stacked=True, ax=ax,
color={'high': '#1a9850', 'medium': '#fee08b', 'low': '#d73027'})
ax.set_xlabel('predicted utilizer strains'); ax.set_ylabel('')
ax.set_title('Per-compound predicted-utilizer certainty composition')
ax.legend(title='certainty', loc='lower right')
fig.tight_layout(); fig.savefig(FIG / '06_certainty_composition.png', dpi=150)
print('saved 06_certainty_composition.png'); plt.close(fig)
certainty high medium low name Abscisic acid 0 2 0 xanthine 0 13 0 Phenylethylamine 0 28 0 TEREPHTHALIC ACID 34 0 0 phthalic acid 0 36 0 4-hydroxybenzaldehyde 10 115 0 3-hydroxybenzoic acid 18 109 0 salicylic acid 2 127 0 saved 06_certainty_composition.png
Write deliverable (b) table¶
out_cols = ['compound_id', 'name', 'npc_pathway', 'tier', 'certainty',
'ncbi_taxid', 'taxon_name', 'superkingdom', 'phylum', 'class',
'order', 'family', 'genus', 'species',
'n_genomes', 'n_sig_carried', 'sig_completeness', 'score']
out = strain[out_cols].sort_values(['name', 'certainty', 'score'],
ascending=[True, True, False]).reset_index(drop=True)
out.to_csv(DATA / 'phylo_utilizer_map.tsv', sep='\t', index=False)
print('wrote data/phylo_utilizer_map.tsv', out.shape)
print('\nsummary: strains per compound x phylum')
print((strain.groupby(['name', 'phylum'])['ncbi_taxid'].nunique()
.unstack(fill_value=0)).to_string())
print('\n=== deliverable (b) headline ===')
print(f'compounds mapped: {strain["name"].nunique()} | utilizer strains placed: '
f'{strain["ncbi_taxid"].nunique()} | high-certainty strain-calls: '
f'{int((strain["certainty"]=="high").sum())}')
wrote data/phylo_utilizer_map.tsv (494, 18) summary: strains per compound x phylum phylum Acidobacteriota Actinomycetota Bacillota Bacteroidota Chloroflexota Myxococcota Pseudomonadota Verrucomicrobiota name 3-hydroxybenzoic acid 0 1 3 7 1 0 112 0 4-hydroxybenzaldehyde 1 9 0 1 1 1 109 1 Abscisic acid 0 1 0 0 0 1 0 0 Phenylethylamine 0 1 0 0 0 0 25 0 TEREPHTHALIC ACID 0 0 3 0 0 0 29 0 phthalic acid 0 20 1 0 0 0 12 0 salicylic acid 0 0 0 0 0 0 127 0 xanthine 0 0 0 1 0 1 8 0 === deliverable (b) headline === compounds mapped: 8 | utilizer strains placed: 359 | high-certainty strain-calls: 64