03 Taxonomy Bridge
Jupyter notebook from the Metal Resistance Ecology: Phylogenetic Conservation vs. Environmental Selection project.
NB03: OTU → Pangenome Taxonomy Bridge¶
Maps MicrobeAtlas OTU lineages to BERDL pangenome species via genus-level taxonomy matching.
Inputs:
data/otu_metadata_raw.csv— all 98,919 OTU rows fromarkinlab_microbeatlas.otu_metadata(downloaded via REST API)data/species_metal_amr.csv— species-level metal AMR gene counts (from NB01, Spark/JupyterHub)data/gtdb_genus_taxonomy.csv— GTDB genus → species mapping (from NB01 Spark output)
Outputs:
data/otu_pangenome_link.csv— OTU → GTDB genus/species link with metal AMR proxydata/otu_taxonomy_parsed.csv— OTU taxonomy parsed into levels
Note: GTDB data requires JupyterHub Spark (see NB01). This notebook runs locally once those CSVs exist.
In [1]:
import pandas as pd
import numpy as np
import re
from pathlib import Path
DATA = Path('../data')
DATA.mkdir(exist_ok=True)
1. Load and parse OTU taxonomy¶
In [2]:
otu = pd.read_csv(DATA / 'otu_metadata_raw.csv')
print(f"OTUs loaded: {len(otu)}")
otu.head(3)
OTUs loaded: 98919
Out[2]:
| otu_id | Tax | n_cells_by_counts | mean_counts | total_counts | |
|---|---|---|---|---|---|
| 0 | 97_29796 | Bacteria;Proteobacteria;Betaproteobacteria;Nit... | 885 | 0.004300 | 1995.0 |
| 1 | 97_36535 | Bacteria;Bacteroidetes;Chitinophagia;Chitinoph... | 443 | 0.002707 | 1256.0 |
| 2 | 97_5082 | Bacteria;Bacteroidetes;Bacteroidia;Bacteroidal... | 2945 | 0.136174 | 63181.0 |
In [3]:
def parse_otu_taxonomy(tax_str):
"""Parse semicolon-delimited OTU taxonomy into a dict.
Format: Kingdom;Phylum;Class;Order;Family;Genus;Species
Handles variable depth (some OTUs only resolved to phylum or class).
Strips bracketed qualifiers like 'Eukaryota[chloroplast]'.
"""
if not isinstance(tax_str, str) or tax_str.strip() == '':
return {}
levels = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
parts = [p.strip() for p in tax_str.split(';')]
result = {}
for i, part in enumerate(parts):
if i >= len(levels):
break
# Strip bracketed qualifiers (e.g. Eukaryota[chloroplast] -> Eukaryota)
clean = re.sub(r'\[.*?\]', '', part).strip()
if clean:
result[levels[i]] = clean
return result
parsed = otu['Tax'].apply(parse_otu_taxonomy).apply(pd.Series)
otu_tax = pd.concat([otu[['otu_id','Tax','n_cells_by_counts','mean_counts','total_counts']], parsed], axis=1)
print(f"Taxonomy depth coverage:")
for level in ['kingdom','phylum','class','order','family','genus','species']:
n = otu_tax[level].notna().sum()
print(f" {level:10s}: {n:6d} ({100*n/len(otu_tax):.1f}%)")
Taxonomy depth coverage: kingdom : 98916 (100.0%) phylum : 67914 (68.7%) class : 66110 (66.8%) order : 64370 (65.1%) family : 41923 (42.4%) genus : 22930 (23.2%) species : 7016 (7.1%)
In [4]:
# Kingdom breakdown
print("Kingdom distribution:")
print(otu_tax['kingdom'].value_counts())
# Flag organellar sequences (16S amplicon artifacts from plants/algae)
otu_tax['is_organellar'] = otu['Tax'].str.contains(r'Eukaryota\[', regex=True, na=False)
print(f"\nOrganellar OTUs (chloroplast/mitochondrion): {otu_tax['is_organellar'].sum()}")
# Unmapped OTU
otu_tax['is_unmapped'] = otu_tax['otu_id'] == 'Unmapped'
print(f"Unmapped bin: {otu_tax['is_unmapped'].sum()}")
Kingdom distribution: kingdom Bacteria 79096 Eukaryota 16562 Archaea 3258 Name: count, dtype: int64 Organellar OTUs (chloroplast/mitochondrion): 1767 Unmapped bin: 1
In [5]:
otu_tax.to_csv(DATA / 'otu_taxonomy_parsed.csv', index=False)
print("Saved data/otu_taxonomy_parsed.csv")
Saved data/otu_taxonomy_parsed.csv
2. Genus-level normalisation¶
The OTU taxonomy uses SILVA-based names; GTDB uses its own genus nomenclature. We match at the genus level (case-insensitive exact match first, then fuzzy if needed).
In [6]:
# Genera present in OTUs
otu_genera = (
otu_tax[otu_tax['genus'].notna() & ~otu_tax['is_organellar'] & ~otu_tax['is_unmapped']]
.groupby('genus')
.agg(
n_otus=('otu_id', 'count'),
total_prevalence=('n_cells_by_counts', 'sum'),
kingdom=('kingdom', 'first')
)
.reset_index()
.sort_values('n_otus', ascending=False)
)
print(f"Unique genera in OTU table: {len(otu_genera)}")
print(f"OTUs with genus annotation: {otu_genera['n_otus'].sum()} / {len(otu_tax)}")
otu_genera.head(10)
Unique genera in OTU table: 3160 OTUs with genus annotation: 22357 / 98919
Out[6]:
| genus | n_otus | total_prevalence | kingdom | |
|---|---|---|---|---|
| 2775 | Streptococcus | 390 | 152648 | Bacteria |
| 375 | Bacillus | 311 | 667707 | Bacteria |
| 2057 | Paenibacillus | 266 | 998707 | Bacteria |
| 2348 | Pseudomonas | 257 | 516244 | Bacteria |
| 2755 | Staphylococcus | 248 | 34169 | Bacteria |
| 704 | Corynebacterium | 229 | 263569 | Bacteria |
| 1045 | Flavobacterium | 229 | 1018421 | Bacteria |
| 661 | Clostridium | 206 | 1030522 | Bacteria |
| 2777 | Streptomyces | 194 | 337733 | Bacteria |
| 432 | Blautia | 190 | 50743 | Bacteria |
3. Load GTDB taxonomy and metal AMR data¶
These files are produced by NB01 (Spark/JupyterHub). If they don't exist yet, run NB01 first.
In [7]:
gtdb_path = DATA / 'gtdb_genus_taxonomy.csv'
amr_path = DATA / 'species_metal_amr.csv'
missing = [p for p in [gtdb_path, amr_path] if not p.exists()]
if missing:
print("⚠ Missing files (requires NB01 Spark output):")
for p in missing:
print(f" {p}")
print("\nProceed to Section 4 for genus-only link (no metal AMR).")
else:
gtdb = pd.read_csv(gtdb_path)
amr = pd.read_csv(amr_path)
print(f"GTDB genera: {gtdb['gtdb_genus'].nunique()}")
print(f"Species with metal AMR data: {len(amr)}, with any AMR: {(amr['n_metal_amr_clusters']>0).sum()}")
GTDB genera: 8419 Species with metal AMR data: 6789, with any AMR: 6789
4. Genus-level join: OTU → GTDB¶
In [8]:
def build_otu_pangenome_link(otu_tax_df, gtdb_df=None, amr_df=None):
"""Build OTU → pangenome link table.
If gtdb_df is None, produces a genus-only link (no pangenome stats).
If amr_df is None, produces link without metal AMR stats.
"""
# Work with genus-annotated, non-organellar, non-unmapped OTUs
df = otu_tax_df[
otu_tax_df['genus'].notna() &
~otu_tax_df['is_organellar'] &
~otu_tax_df['is_unmapped']
].copy()
# Normalise genus name for matching (strip trailing subclade suffixes)
df['genus_clean'] = df['genus'].str.strip()
if gtdb_df is not None:
# Exact case-insensitive match
gtdb_df['genus_lower'] = gtdb_df['gtdb_genus'].str.lower()
df['genus_lower'] = df['genus_clean'].str.lower()
# Merge on lower-case genus
genus_amr = (
amr_df.merge(gtdb_df[['gtdb_species_clade_id','gtdb_genus','genus_lower']],
on='gtdb_species_clade_id', how='left')
.groupby('genus_lower')
.agg(
n_gtdb_species=('gtdb_species_clade_id', 'count'),
n_species_with_metal_amr=('n_metal_amr_clusters', lambda x: (x > 0).sum()),
mean_metal_amr_clusters=('n_metal_amr_clusters', 'mean'),
max_metal_amr_clusters=('n_metal_amr_clusters', 'max'),
mean_n_metal_gene_types=('n_unique_metal_genes', 'mean'),
mean_metal_core_fraction=('n_metal_core', lambda x: (x / amr_df.loc[x.index, 'n_metal_amr_clusters'].replace(0, np.nan)).mean()),
)
.reset_index()
)
df = df.merge(genus_amr, on='genus_lower', how='left')
n_linked = df['n_gtdb_species'].notna().sum()
print(f"OTUs linked to GTDB genus: {n_linked}/{len(df)} ({100*n_linked/len(df):.1f}%)")
else:
print("No GTDB data — genus-only link produced")
cols = ['otu_id','Tax','kingdom','phylum','class','order','family',
'genus','genus_clean','n_cells_by_counts','mean_counts','total_counts']
if gtdb_df is not None:
cols += ['n_gtdb_species','n_species_with_metal_amr','mean_metal_amr_clusters',
'max_metal_amr_clusters','mean_n_metal_gene_types']
return df[[c for c in cols if c in df.columns]]
# Run with whatever data is available
try:
link = build_otu_pangenome_link(otu_tax, gtdb_df=gtdb, amr_df=amr)
except NameError:
# gtdb / amr not loaded yet
link = build_otu_pangenome_link(otu_tax)
link.to_csv(DATA / 'otu_pangenome_link.csv', index=False)
print(f"Saved data/otu_pangenome_link.csv ({len(link)} rows)")
link.head(5)
OTUs linked to GTDB genus: 12438/22357 (55.6%) Saved data/otu_pangenome_link.csv (22357 rows)
Out[8]:
| otu_id | Tax | kingdom | phylum | class | order | family | genus | genus_clean | n_cells_by_counts | mean_counts | total_counts | n_gtdb_species | n_species_with_metal_amr | mean_metal_amr_clusters | max_metal_amr_clusters | mean_n_metal_gene_types | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 97_29796 | Bacteria;Proteobacteria;Betaproteobacteria;Nit... | Bacteria | Proteobacteria | Betaproteobacteria | Nitrosomonadales | Methylophilaceae | Methylobacillus | Methylobacillus | 885 | 0.004300 | 1995.0 | 1.0 | 1.0 | 1.00 | 1.0 | 1.0 |
| 1 | 97_36535 | Bacteria;Bacteroidetes;Chitinophagia;Chitinoph... | Bacteria | Bacteroidetes | Chitinophagia | Chitinophagales | Chitinophagaceae | Segetibacter | Segetibacter | 443 | 0.002707 | 1256.0 | NaN | NaN | NaN | NaN | NaN |
| 2 | 97_5082 | Bacteria;Bacteroidetes;Bacteroidia;Bacteroidal... | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | Porphyromonas | 2945 | 0.136174 | 63181.0 | 5.0 | 5.0 | 1.00 | 1.0 | 1.0 |
| 3 | 97_1783 | Bacteria;Firmicutes;Bacilli;Bacillales;Listeri... | Bacteria | Firmicutes | Bacilli | Bacillales | Listeriaceae | Listeria | Listeria | 578 | 0.004701 | 2181.0 | 8.0 | 8.0 | 11.25 | 38.0 | 6.0 |
| 4 | 97_58317 | Bacteria;Actinobacteria;Actinomycetia;Actinomy... | Bacteria | Actinobacteria | Actinomycetia | Actinomycetales | Actinomycetaceae | Bowdeniella | Bowdeniella | 2 | 0.000004 | 2.0 | NaN | NaN | NaN | NaN | NaN |
5. Nitrifier OTU annotation (positive control)¶
In [9]:
# Known nitrifier genera from EDA
NITRIFIER_GENERA = {
# Step 1 AOB (bacterial ammonia oxidisers)
'Nitrosospira': 'AOB_step1',
'Nitrosomonas': 'AOB_step1',
# Step 1 AOA (archaeal ammonia oxidisers — Thaumarchaeota)
'Nitrososphaera': 'AOA_step1',
'Nitrosopumilus': 'AOA_step1',
'Nitrosocosmicus': 'AOA_step1',
# Step 2 NOB (nitrite oxidisers)
'Nitrospira': 'NOB_step2',
'Nitrospina': 'NOB_step2',
'Nitrobacter': 'NOB_step2',
'Nitrococcus': 'NOB_step2',
}
# Annotate the full OTU table
otu_tax['nitrifier_role'] = otu_tax['genus'].map(NITRIFIER_GENERA)
nitrifiers = otu_tax[otu_tax['nitrifier_role'].notna()]
print(f"Nitrifier OTUs annotated: {len(nitrifiers)}")
print(nitrifiers.groupby('nitrifier_role')[['otu_id','n_cells_by_counts','total_counts']]
.agg({'otu_id':'count','n_cells_by_counts':'sum','total_counts':'sum'})
.rename(columns={'otu_id':'n_otus'}))
Nitrifier OTUs annotated: 220
n_otus n_cells_by_counts total_counts
nitrifier_role
AOA_step1 152 675861 119199123.0
AOB_step1 36 295630 12078989.0
NOB_step2 32 547727 23789801.0
In [10]:
# Add nitrifier annotation to link table
link = link.merge(
otu_tax[['otu_id','nitrifier_role']],
on='otu_id', how='left'
)
link.to_csv(DATA / 'otu_pangenome_link.csv', index=False)
print("Updated data/otu_pangenome_link.csv with nitrifier annotations")
# Summary
print(f"\nLink table summary:")
print(f" Total OTU rows: {len(link)}")
print(f" With genus annotation: {link['genus'].notna().sum()}")
print(f" Nitrifier OTUs: {link['nitrifier_role'].notna().sum()}")
if 'n_gtdb_species' in link.columns:
n_amr = (link['mean_metal_amr_clusters'] > 0).sum()
print(f" Linked to GTDB genus with metal AMR: {n_amr}")
Updated data/otu_pangenome_link.csv with nitrifier annotations Link table summary: Total OTU rows: 22357 With genus annotation: 22357 Nitrifier OTUs: 220 Linked to GTDB genus with metal AMR: 12438
6. Coverage diagnostics¶
In [11]:
print("=== Coverage by kingdom ===")
print(otu_tax.groupby('kingdom', dropna=False)['otu_id'].count().sort_values(ascending=False))
print("\n=== Genus resolution by kingdom ===")
for kd in ['Bacteria','Archaea','Eukaryota']:
sub = otu_tax[otu_tax['kingdom'] == kd]
has_genus = sub['genus'].notna().sum()
print(f" {kd}: {has_genus}/{len(sub)} ({100*has_genus/max(len(sub),1):.1f}%) resolved to genus")
print("\n=== Top unmatched genera (high prevalence, no GTDB link) ===")
if 'n_gtdb_species' in link.columns:
unlinked = link[(link['genus'].notna()) & (link['n_gtdb_species'].isna())]
print(unlinked.groupby('genus')['n_cells_by_counts'].sum()
.sort_values(ascending=False).head(10))
=== Coverage by kingdom === kingdom Bacteria 79096 Eukaryota 16562 Archaea 3258 NaN 3 Name: otu_id, dtype: int64 === Genus resolution by kingdom === Bacteria: 19894/79096 (25.2%) resolved to genus Archaea: 1133/3258 (34.8%) resolved to genus Eukaryota: 1903/16562 (11.5%) resolved to genus === Top unmatched genera (high prevalence, no GTDB link) === genus Gaiella 916965 Solirubrobacter 648747 Hyphomicrobium 591327 Stenotrophobacter 556753 Nitrospira 541477 Actinomarinicola 448698 Rhodoplanes 448540 Dongia 441792 Conexibacter 404158 Sporichthya 375370 Name: n_cells_by_counts, dtype: int64