01 Metal Amr Species
Jupyter notebook from the Metal Resistance Ecology: Phylogenetic Conservation vs. Environmental Selection project.
NB01: Metal Resistance Gene Extraction + GTDB Taxonomy¶
REQUIRES JUPYTERHUB — kbase_ke_pangenome is only accessible from JupyterHub notebook credentials.
Outputs (saved to data/ — used by NB03 taxonomy bridge):
data/species_metal_amr.csv— species-level metal AMR gene countsdata/gtdb_genus_taxonomy.csv— GTDB genus → species clade mappingdata/metal_amr_gene_detail.csv— per-cluster AMR gene details
Metal gene list: AMRFinderPlus-annotated genes for Hg, As, Cu, Zn, Cd, Cr, Ni, Co, Ag, Te
In [1]:
# Standard JupyterHub Spark entry point
spark = get_spark_session()
import pandas as pd
from pathlib import Path
DATA = Path('../data')
DATA.mkdir(exist_ok=True)
1. GTDB genus → species clade mapping¶
In [3]:
# Fixed: Calculate counts in a CTE to avoid window function limitations
gtdb = spark.sql("""
WITH genome_counts AS (
SELECT gtdb_species_clade_id, COUNT(DISTINCT genome_id) as clade_genome_count
FROM kbase_ke_pangenome.genome
GROUP BY gtdb_species_clade_id
)
SELECT
sc.gtdb_species_clade_id,
sc.GTDB_species,
regexp_extract(sc.GTDB_taxonomy, 'g__([^;]+)', 1) AS gtdb_genus,
regexp_extract(sc.GTDB_taxonomy, 'p__([^;]+)', 1) AS gtdb_phylum,
gc.clade_genome_count
FROM kbase_ke_pangenome.gtdb_species_clade sc
LEFT JOIN genome_counts gc ON sc.gtdb_species_clade_id = gc.gtdb_species_clade_id
""").toPandas()
print(f"Loaded {len(gtdb)} species clades.")
gtdb.to_csv(DATA / 'gtdb_genus_taxonomy.csv', index=False)
gtdb.head(3)
Loaded 27690 species clades.
Out[3]:
| gtdb_species_clade_id | GTDB_species | gtdb_genus | gtdb_phylum | clade_genome_count | |
|---|---|---|---|---|---|
| 0 | s__Klebsiella_pneumoniae--RS_GCF_000742135.1 | s__Klebsiella_pneumoniae | Klebsiella | Pseudomonadota | 14240 |
| 1 | s__Staphylococcus_aureus--RS_GCF_001027105.1 | s__Staphylococcus_aureus | Staphylococcus | Bacillota | 14526 |
| 2 | s__Salmonella_enterica--RS_GCF_000006945.2 | s__Salmonella_enterica | Salmonella | Pseudomonadota | 11402 |
2. Metal AMR gene extraction from bakta_amr¶
In [4]:
METAL_TYPE_MAP = {
# Mercury
**{g: 'Hg' for g in ['merA','merB','merC','merD','merP','merR','merT','merE']},
# Arsenic & Antimony
**{g: 'As_Sb' for g in ['arsA','arsB','arsC','arsD','arsR','arsH']},
# Copper
**{g: 'Cu' for g in ['copA','copB','copC','copD','copY','copZ','cusA','cusB','cusC','cusF','cusR','cusS']},
# Zinc / Cadmium
**{g: 'ZnCd' for g in ['zntA','zntB','zntR','cadA','cadC']},
# Multi-metal (Cobalt/Zinc/Cadmium)
**{g: 'CoZnCd' for g in ['czcA','czcB','czcC','czcD','czcR','czcS']},
# Chromate
**{g: 'Cr' for g in ['chrA','chrB','chrC','chrF']},
# Nickel
**{g: 'Ni' for g in ['nccA','nccB','nccC','rcnA','nicO']},
# Silver
**{g: 'Ag' for g in ['silA','silB','silC','silP']},
# Tellurite
**{g: 'Te' for g in ['tehA','tehB','tcrB']},
# Lead (New)
**{g: 'Pb' for g in ['pbrA','pbrB','pbrC','pbrR']},
}
METAL_GENES = list(METAL_TYPE_MAP.keys())
gene_in_list = "'" + "','".join(METAL_GENES) + "'"
In [5]:
# Filtering out 'metallo-beta-lactamases' (bla) and other non-resistance metal genes
metal_detail_spark = spark.sql(f"""
SELECT
ba.gene_cluster_id,
ba.amr_gene,
ba.amr_product,
gc.gtdb_species_clade_id,
gc.is_core,
gc.is_auxiliary
FROM kbase_ke_pangenome.bakta_amr ba
JOIN kbase_ke_pangenome.gene_cluster gc ON ba.gene_cluster_id = gc.gene_cluster_id
WHERE (
ba.amr_gene IN ({gene_in_list})
OR LOWER(ba.amr_product) RLIKE 'mercury|arsenic|arsenate|copper|zinc|cadmium|chromate|cobalt|nickel|tellurite|silver|lead'
)
AND LOWER(ba.amr_product) NOT LIKE '%lactamase%'
AND LOWER(ba.amr_product) NOT LIKE '%penicillin%'
""")
# Convert to Pandas after Spark does the heavy filtering
metal_detail = metal_detail_spark.toPandas()
# Helper to categorize genes that were found via text search instead of the gene list
def map_metal(gene, product):
if gene in METAL_TYPE_MAP: return METAL_TYPE_MAP[gene]
product = str(product).lower()
for metal_key, display_name in [('mercur','Hg'), ('arsen','As'), ('copper','Cu'), ('zinc','Zn'), ('cadmium','Cd'), ('chromat','Cr'), ('cobalt','Co'), ('nickel','Ni'), ('tellurit','Te'), ('silver','Ag'), ('lead','Pb')]:
if metal_key in product: return display_name
return 'other'
metal_detail['metal_type'] = metal_detail.apply(lambda x: map_metal(x['amr_gene'], x['amr_product']), axis=1)
metal_detail.to_csv(DATA / 'metal_amr_gene_detail.csv', index=False)
print(f"Validated {len(metal_detail)} metal-related clusters.")
Validated 20823 metal-related clusters.
In [6]:
# Aggregate to species level
amr_species = (
metal_detail
.groupby('gtdb_species_clade_id')
.agg(
n_metal_clusters = ('gene_cluster_id', 'nunique'),
n_metal_types = ('metal_type', 'nunique'),
n_core_metal = ('is_core', 'sum')
)
.reset_index()
)
# Merge with GTDB metadata and counts
amr_final = amr_species.merge(gtdb, on='gtdb_species_clade_id', how='left')
# Normalize: Average clusters per genome in this clade
amr_final['clusters_per_genome'] = amr_final['n_metal_clusters'] / amr_final['clade_genome_count']
amr_final.to_csv(DATA / 'species_metal_amr.csv', index=False)
print(f"Processed {len(amr_final)} species with metal AMR.")
Processed 5640 species with metal AMR.
In [7]:
# Phylum analysis with minimum species threshold
phylum_summary = (
amr_final.groupby('gtdb_phylum')
.agg(
avg_clusters_per_genome = ('clusters_per_genome', 'mean'),
total_species_in_data = ('gtdb_species_clade_id', 'count')
)
.query('total_species_in_data >= 10')
.sort_values('avg_clusters_per_genome', ascending=False)
)
print("Top Phyla by Metal Resistance Intensity (Normalized):")
print(phylum_summary.head(10))
Top Phyla by Metal Resistance Intensity (Normalized):
avg_clusters_per_genome total_species_in_data
gtdb_phylum
Pseudomonadota 0.904523 2632
SZUA-79 0.876190 14
Acidobacteriota 0.862411 47
Dormibacterota 0.839583 12
Gemmatimonadota 0.702668 43
Nitrospirota 0.589319 48
Bacillota_E 0.570728 17
Actinomycetota 0.568053 659
Desulfobacterota_F 0.546091 27
Asgardarchaeota 0.545833 10